24 March 2013

One of my jobs at the Census Bureau is to help maintain programs used for record linkage. These programs are written in C and Fortran, and they’re optimized to work on Census-sized data. They’re not, however, optimized for ease of use, and this is evident in the configuration files required to run the programs. I wanted to see how hard it would be to replace these configuration files with ones written in lua or scheme and designed to be human-friendly. Here’s how that went.

I put the results of this experiment in a github repo so you can follow along. The starting point is params.txt, which is a pastiche of the real configuration files that I work with. I describe params.txt in the README, and it might be helpful to read it to judge whether I’ve done a good job with my domain-specific languages (DSLs).

Fair warning: this is a brain dump and not a tutorial.

The lua version

This is my first time using lua for anything, but I’ve noticed for a while that when people talk about lua they’re almost always praising it. And recently I started using premake, which gave me a great example of how good a lua-based config file could look.

My results are in the lua/ folder. The first thing I want to point out in params.lua versus params.txt is that I’ve separated the description of the file layouts from the matching parameters by creating the separate “fields” and “strategy” sections. In addition to shortening the length of inscrutible lists of numbers that appear in the file, this prevents me from redundantly specifying any field locations.

About those inscrutible lists, I did consider using keywords

{name="first", length=20, starta=0, startb=0}

instead of the less clear

{"first", 20, 0, 0}

The problem was that when defining several fields in a row, the keywords become redundant line noise. I like

{"first", 20, 0, 0},
{"last", 20, 20, 21},
{"mi", 1, 40, 20}

better than

{name="first", length=20, starta=0, startb=0},
{name="last", length=20, starta=20, startb=21},
{name="mi", length=1, starta=40, startb=20}

In the “fields” section I considered using the field name as a keyword, so a user would specify

first = {20, 0, 0}

instead of

{"first", 20, 0, 0}

but I didn’t like that field names would appear with quotes in one part of the file and without them in another.

On the C side, I was surprised that the official lua documentation didn’t give an example of a simple C program that evaluated a lua script and processed the results. I had to piece this together from various tutorials on the web. This isn’t a tragedy, per se, but it means I’m not clear on things like the difference between lua.h and lualib.h (I gathered that lauxlib.h was for frequently useful but not-basic funtions). It also means I spent an hour pulling my hair out trying to diagnose a segfault before I figured out that I needed to luaL_openlibs(L) to make the pairs function visible.

Other than that, the stack interface was straightforward enough. My feeling is that for simple operations it’s more concise than the guile interface (fewer type conversions), but for more complex operations it’s more verbose.

The scheme version

The scheme version of my configuration parser is in the guile/ folder and the final configuration file is in params.scm. I’ve previously dabbled with a scheme-based config file in my banmi project, but for that project I wrote bindings so that I could call C functions from scheme (as opposed to simply processing scheme source files in C). I used the guile implementation of scheme because I like its C interface, but I may write a second version that statically links chibi scheme because I have to work on servers that don’t have guile installed.

As all lispers know, macros really shine at this sort of thing, and I’m slowly getting the hang of syntax-rules. Using macros, I can make it so the user doesn’t have to put quotes around everything. I can write a flags macro so that the user can specify

(flags verbose no-summary)

instead of

(flags verbose: true summary: false)

or something similar. With this implementation, my only hang-up on the C side was trying to figure out whether I wanted scm_init_guile, scm_boot_guile, or scm_with_guile. As far as I can tell, the latter is the correct choice for this application: starting an interpreter to process a couple files and then dropping back to plain old C. I spent most of my time rewriting (and debugging) macro definitions because I thought I could make them more elegant or concise.

Additional thoughts

Compared to the scheme version, the lua config file requires a lot of commas and quotations marks. Even so, I imagined that the lua version might be easier to read for people with some programming skills, probably because of all the parenthesis-hate heaped on lisps over the years. But for the record, my non-programmer wife preferred the lispy version.

One thing I want to mention is that the scripts I wrote to define the configuration DSLs were bare-bones and, in particular, made no attempt to verify correctness or produce meaningful error messages. Once I commit to one of these languages, I can imagine building up a library to help me configure my configurations.

It bothers me a little that in order to use either of these approaches, I have to distribute script files alongside my executable. At least that’s the easy way. Looking a premake again, I see that another approach is to embed the script source into a char array in a C source file. Interesting.