Get list of worksheets on the command line

April 14th, 2016

Today I need to get a list of worksheets from a very large spreadsheet (~100 worksheets, many thousands of cells on each) quickly from the command line. My ruby script tried to parse the entire workbook and exploded all my RAM. Then I discovered grep -0 which prints matching occurrences only (each on a separate line) which is very useful with regular expression patterns. The “sheet name” entries are in the “workbook.xml” zipped up in the Excel file.

$ unzip -c your-large-spreadsheet.xlsx xl/workbook.xml | grep -o sheet\ name=\"[^\"]*\" | cut -d = -f 2

Getting round the limitations of Excel VBA’s UsedRange function

April 5th, 2016

Turns out that the Excel VBA function UsedRange doesn’t always give the expected results. It gives you the range of cells that have ever been used. If you’ve formatted them, or they used to have values in that you’ve since deleted, you still get those cells. Today I wanted to iterate all cells in a lot of sheets. Some of those sheets had lots of cells in their UsedRange. I hit upon a solution I’ve not seen anywhere else: take the union of constants and formulas in a sheet. This  is quick, because you can use SpecialCells to do the heavy lifting. And you can refine the results to exclude errors using SpecialCells Value argument (the default 23 below selects everything). Here’s the code

Function myUsedRange(ws As Worksheet, Optional lookfor As Long = 23) As Range
    Dim c As Range, f As Range
    On Error Resume Next
    With ws.Range("A1")
        Set c = .SpecialCells(xlCellTypeConstants, lookfor)
        Set f = .SpecialCells(xlCellTypeFormulas, lookfor)
        If c Is Nothing Then Set c = .Cells
        If f Is Nothing Then Set f = .Cells
    End With
    Set myUsedRange = Union(c, f)
End Function

This returns a non-contiguous range containing only the cells you want. This version is a bit lazy and includes [A1] if there are no constants or no formulas matching your search. I didn’t mind because I was using lookfor=2 to search for text which reduced the number of cells by thousands. Hundreds of thousands in some cases. A better version would take a default as a parameter and return it only if neither  constants or formulas gave any results.

pip install rx

March 22nd, 2016

I just installed RxPy the Reactive extensions for Python, as you may guess from the title of this article. Well, that was easy. What next? Here’s a delightfully short program which does — ahem — something:

import sys
import rx

rx.Observable.from_(sys.stdin).subscribe(print)

It echoes stdin to stdout. Kinda.

  • There are more newlines than I was expecting — an extra one after each line is echoed
  • The single line is actually an expression returning a Disposable that you can use to unsubscribe. I’m discarding it.
  • I have syntastic installed in Vim and pylint complains of a syntax error when I pass print to subscribe so perhaps I need to configure it to know about python3.

Excel excess: how to remove unwanted styles with VBA

March 11th, 2016

Long story short, I just opened a workbook with 37,000 styles :(. Apparently this is a common problem). Allen Wyatt of ExcelTips suggests deleting non-builtin styles. But I had over 6,000 of those. And some of my user-defined styles are important!

My solution is to remove any style whose name ends with a number, keeping one of each base-name. Cells with deleted styles get “Normal”.

The code uses regular expressions to trim trailing numbers from style names, and uses a dictionary for the set of base names.

Option Explicit

Sub DeduplicateStyles()
  Call DeleteDupStyles(Workbooks("NEW_multi_sector.xlsx"))
End Sub

Sub DeleteDupStyles(wb As Workbook)
  Dim sty As style
  Dim intRet As Integer
  Dim dict As New Scripting.Dictionary
  Dim count As Integer: count = 0

  For Each sty In wb.styles
    Dim n As String: n = trimTrailingNumbers(sty.Name)
    If dict.Exists(n) Then
      ' Debug.Print ("deleting: " & sty.Name)
      On Error Resume Next
      sty.Delete
      If Err.Number <> 0 Then
        Debug.Print ("ERROR deleting: " & sty.Name)
      End If
    Else
      Debug.Print ("keeping: " & sty.Name)
      dict.Add n, 1
    End If

    count = count + 1
    If count Mod 100 = 0 Then
      DoEvents
    End If
  Next sty
End Sub

Function regexp(pattern As String) As regexp
  Dim rx As New regexp
  With rx
    .Global = True
    .MultiLine = False
    .IgnoreCase = False
    .pattern = pattern
  End With
  Set regexp = rx
End Function

Function trimTrailingNumbers(s As String) As String
  trimTrailingNumbers = regexp("( \d+)+$").Replace(s, "")
End Function

Keeping named files in bash

February 1st, 2016

It just took me way too long to solve this, so I’m going to write it up quickly:

The Problem

  • I have a file containing the basenames of spreadsheets that I want to keep in a directory structure. I want to delete all the other spreadsheets.
  • There are levels of nested directories and the files are distributed among them
  • The file names, and the directory names contain spaces (don’t look at me like that, it’s other people’s work on Windoze — at least I’m using bash
  • I’m using bash on Linux, so I have proper find, etc.

A Solution

#!/bin/bash
modelfiles=../files-to-keep.txt

find . -name \*.xls\?  | while read  f
do
    if ! grep -q "${f##*/}" $modelfiles
    then
        rm "$f"
    fi
done

I’m sure there are more efficient ways. Contact me if you have ideas. Meanwhile I’ve got to get back to work so I’m sticking with “it works”.

UPDATE: that didn’t work when I tried to re-run it just now. So switched from grep -v to if ! …

Waiting on the platform

October 10th, 2015

A recent conversation with Lauren Serota reminded me of a challenge we often face building small-to-medium sized digital solutions for our clients: the right platform.

We were talking about building custom phone apps in the context of Myanmar’s smartphone revolution, and the challenge of finding the right partners to implement innovative mobile service delivery systems. She told a story of one such development which looked great but cost too much to make small changes to. That entailed going through another development cycle with the software company; for changes that felt like they should be under the service team’s control.
“They wrote a custom native app.”, she said, “instead of building a platform.”
This started me revisiting similar stories from Aptivate, where we were the software company.

The IT Specialists

I’m thinking about AuthorAID, which helps support researchers from developing countries preparing academic articles for publication in peer-reviewed journals. It is a fairly complex web application which functions not entirely unlike a dating site (matching mentors with mentees) with a way to track collaboration through research publishing tasks. We built it with Django, a powerful framework rapid web development.

Django was a great choice, it enabled us to build AuthorAID at a cost within INASP‘s budget; an impressive achievement because AuthorAID is essentially a fairly complex, bespoke web application serving a relatively small audience. That’s why it’s our web platform of choice. But neither AuthorAID, nor Django are the platform I think Lauren was talking about.

As we maintain and develop AuthorAID for INASP, some of the changes we make, like adding faceted search, feel like they belong squarely with the “IT specialists”. Others, like adding fields to the form that users fill when they register, feel like they belong with the team running the site.

A lot of Django’s power for rapidly building web applications comes from its focus on meta-data. To add those new fields to the sign-up form, we edit the Model file which is written in Python, the programing language Django is written in. So, you have to be a programmer to change them. But once that’s done, changes to web-forms can be generated automatically from the models. The new field values will need to be stored in Django’s relational database, but the necessary changes to the database schema can also be generated from the models, and theres a whole system to manage applying those database changes. Django is a great tool for the “IT specialists”. Which makes sense: it was created by web developers at a Newspaper.

This works out well for INASP who retain a relationship with Aptivate that enables us to act as their “IT specialists”, for the website stuff. But some of the web applications we build aren’t commissioned by organisations like INASP, but as part of change programmes with a limited lifetime and budget. They can’t afford to keep us on as IT specialists, in case their web application needs to be tweaked or extended. If they want to change the data they gather or report on, they need to be able to make those changes themselves. They don’t want a bespoke application built with a rapid-development framework; I think they need the “Platform”.

The General Solution

I’m thinking now of another project that, in the end, we didn’t do. We were written into the budget proposal for a three year programme with institutional funding, to provide a database for monitoring and evaluation (M&E) data. Budget was small and the requirements weren’t very clear until the programme started and a team assembled. We did our discovery process soon found out that our clients had high hopes for their M&E database, but they could predict neither the structure of the data they would be storing, nor the work-flows they would want it to support.

Trying to manage the risk of Building The Wrong Thing, we pushed for specifics specific examples.
“It doesn’t have to be an all singing, all dancing system.”, they told us in one meeting.
“Okay”, I replied, “but you have to tell us exactly which songs, and which dances you will want it to do.
Eventually they told us not to keep trying to understand their specific requirements, and “just deliver the generic version”.

It’s much harder (read more expensive) to build a generic tool than a specific one. It’s very hard to get the abstractions right, especially without passing through specific examples on the way. But our client had had something specific in mind: a low-cost, generic solution. I think I understand where they were coming from: Excel. The generic solution, for fifty bucks, built on billions of dollars of R&D and decades of getting the abstractions right.

Most people have never bought bespoke software, and don’t know how to. If we were one of those people, the closest thing we can imagine to bespoke software might be a customised spreadsheet; with some protected cells, nice colouring, some predefined formulae and some VBA macros behind the scenes. I’m actually working on one of these at the moment, for another client. If I don’t do all the customisation work, what will be left? The full awesome power of Excel. Users would have to work harder to use it but, with the right skills, they could do so. Excel is a cheap product and all that customization effort is expensive IT specialist consultancy. I understand it must feel frustrating When the IT specialists start asking ‘which songs and which dances?’. Just give us The Thing. The thing we can live in and work it out for ourselves. The web-based Excel* for data-driven applications.

* I know this sounds like Google sheets, but they weren’t actually looking for an on-line spreadsheet.

Compared to evolving a customised spreadsheet, building a bespoke web application in, say, Django is a completely different proposition. It takes significant work to get to an MVP and when you get there the only data and workflow it supports are those the product designers have chosen to include. There are no blank workbooks to fall back on.

I’m wondering if there is something awesome we should be using as a platform for web applications. And if there isn’t shouldn’t we be designing it?

Deadlock: BDD testing a python Tornado app with py.test and Splinter

September 23rd, 2015

I’m only a day into my new asynchronous web development project in Tornado, and I already have a deadlock. And this is remarkable, I think, because all I have done so far is a copy-and-paste the Hello World example app from the Tornado website. An example which “does not use any of Tornado’s asynchronous features”. So, how did I manage to get deadlock?

Well, on top of Hello World, I have a test. A functional test, written with pytest-bdd that uses pytest-splinter makes sure here is some sort of web server up and running. Now pytest-splinter provides a browser object as a py.test fixture. Tests can use this to access the site, so
long as a server is running. Since I’m shopping for py.test plugins, I added pytest-tornado whose http_server fixture starts a tornado HTTP server, serving the application I provide by defining an app fixture.

All set to go? I thought I was. But I got the big crunch. My splinter browser starts but no page loads. What’s up? Turns out the tests aren’t asynchronous. The splinter driver is blocking and the asynchronous web server never gets a look-in to serve the content. Pytest-tornado wants me
to write tests using Tornado’s asynchronous http client, so it and the server can play nicely together.

That’s all fair, but I want to use Splinter on this project so I’m going to want to start my server in a thread. I created an apprunner module that can start my Tornado app in a thread (and return the url to access it). And also provides a function to instruct the server to stop, and exit the
thread:

import threading
import tornado
from .application import Application


ioloop = tornado.ioloop.IOLoop.instance()


def start(TESTING_PORT=8192):
    """ Start Application runing on given port. Retrn the listening url"""
    global thread
    app = Application()
    app.listen(TESTING_PORT)
    thread = threading.Thread(target=ioloop.start)
    thread.start()
    return "http://localhost:{}".format(TESTING_PORT)


def stop():
    """ Ask IOLoop to stop running, and wait for its thread to complete"""
    global thread
    ioloop.add_callback(ioloop.stop)
    thread.join()

Then I define my own fixture to start and stop the server using pytest
yield_fixture:

@pytest.yield_fixture
def app():
    yield start()
    stop()

And now I can write lovely clean BDD steps using splinter browser and my
own app running in a thread, like this:

@when("I visit the site")
def visit_site(browser, app):
    browser.visit(app)

What do chameleons eat?

June 28th, 2015

When this question comes up in a children’s book these days, what do the characters do?

20150627_073158

Zoe (and Beans) rushes inside to Google it.

Which brings me to Mr Christmas and his flying teapot:20150627_073030

All you do is refuel every thousand miles, and you can fly up to three times the speed of sound!

Nic was sceptical.   Google couldn’t give the answer, but Wolfram Alpha did.

At three times the speed of sound, he’s going to have to refuel (with teabags) every 26 minutes and three seconds. For a trip round the world’s circumference that’s nearly eleven hours travel not counting time taken to deliver presents. On top of that he has to stop at least every half hour to brew up. Now add in loo breaks… he’s not going to make it.

Exaile as a DJ workstation

May 6th, 2015

Turning This…

Exaile out the box

Exaile, out of the box, is a smart looking multi-platform music player. The Exaile website lists “Simple interface” as its number-one feature, and mentions that it has lots of plugins. Exaile should be the first choice for Swing-dance DJs, and anyone else who wants to build play-lists on the fly.

Into This…

Screenshot from 2015-04-01 21:15:04

Exaile does stuff other music players haven’t even heard of. To get the most out of it you have to enable and configure some of its plugins. Here’s how I set up mine:
Read the rest of this entry »

Loss of language

May 30th, 2014

brang01 brang02brang03 brang04 brang05 brang06