PROGRAMMING and BIOINFORMATICS: Literate and test-driven development, especially for data-processing projects.
One of my concerns about bioinformatics (or computational biology) is the quality of the software efforts. As encouragement for thought on this, I’ve outlined below an approach I take for many of my projects.
When you’re running bulk data processing, it’s pretty hard to know if the code is really doing what it’s supposed to. It ends up being pretty much a black box, in cases even for the person who wrote the code.
One informal verification approach is to take some input data you are very familiar with and explore the output you get. While I recommend this, it isn’t terribly strong testing. You’re unlikely to put a lot of effort into it (let’s face it, we all only have so much time) and if you’re the developer you’re likely to focus on the same things that might have occurred to you whilst developing the code meaning you’re unlikely to test for ‘unexpected’ things, which is… after all… the whole point of testing.
What I’m leading to more is formalised approaches to testing code, regression tests in particular. In particular, I’m curious as to how bioinformatics developers are developing their code. I’d like to think, or at least wish, that most have formal testing as part of their development process, but then again you don’t see remarks about how the software was tested in most bioinformatics publications so how are we to know? (Were I the editor of a bioinformatics journal, I’d consider making it a requirement that the testing regime be explicitly discussed. Maybe that’d make me an overly harsh editor – ?!)
For some projects my approach is a mix of sort-of literate programming and test-driven development.
Most programmers will have heard of these terms, so I’ll try not belabour explaining them. (Those who’d rather cut to the chase can track down the page to Software development in bioinformatics.)
An important aspect is that it ties in with how you write the code; it’s not after-the-event (coding) but during it, in fact before coding.
Literate programming, simplistically put, is based around the idea that you write a description of the purpose of the code first, then write the code against that.
Test-driven development–done strictly–involves writing tests first, that the code is then developed against.
More on these later. Here’s a sketch outline of an approach I often use. (My emphasis here is not the particular tools, but the overall approach and the implications.)
I start by aiming to have an overall design featuring a small main program with trivial logic, with the bulk of the work in libraries. Essentially the main program is to be just a skeleton that calls up the real work. The reason is partly because the testing approach I use is most readily applied to evoking routines in libraries, but also because this encourages developing code in a form that can be re-used later. Of course, it also tends to make the top-level code simple.
Writing about it
For each routine or method, I first write within the source code documentation describing what it does, or more accurately, what it is to do (as I do this before writing the code). This is not an informal comment, but annotated comments indicating the name of the routine (or method), the parameters, the result, a synopsis, description, and so on.
(I’m going to use ‘routines’ from here on. Mentally substitute ‘method’ or ‘object’ as suits yourself.)
The descriptions I write are fairly extensive and can be collated and presented as a website that summarises the routines in a library, somewhat similar to the commercial summaries of ‘standard’ class libraries, etc.
For those who are curious my code documentation comments look something like the example below, showing only a subset of the features:
# @sub GHJ_DT_Add_Column
# @abstract Add a column to a GHJ_DT object given a field name or column number
# Modelled on GHJ_DTGetCol(), this routine inserts a column into the DT object passed.
# The DT object may be undefined, empty or hold contents.
# The position the column is to be inserted into the DT object can be specified; the
# default is to append it as the last column.
# Callers pass either the text naming the field (column) or a number, which is taken to
# represent a column number (counted from zero) . . .
# @parameter $target_DT_Ref Reference to GHJ_DT (data table) object
# @parameter $target_Field_Name Name of field (column)
# @parameter $source_DT_Ref Ref. to a GHJ_DT holding the field column to add
# @parameter $source_Field_Name Name of field (column)
# @parameter $placement [optional] < 0 => before target column; >= 0 => after
# @result undef if an error occurs, otherwise the new number of columns in the DT (>0)
(This example is for Perl, if anyone is wondering. The same scheme can be applied to other programming languages. There are other types of documentation blocks for the overall project, such as to-do lists, history, etc., so that the whole source code is in effect self-documenting. Yes, Perl mongers, I’m aware of POD – the scheme I’m using is portable across several languages and besides I’ve become accustomed to it!)
Code for the routine is then written against the description of what it is to do. Writing the description first, helps you think through what coding is needed and what issues might arise, hopefully heading off the worst of them before you start coding. To me it’s important to start with a clear, precise, idea of what a routine is to achieve.
It’s more documentation than many might do. (Maybe it shows I’m a writer at heart?)
I find this approach also aids code re-usability and in time spares a certain amount of re-inventing old code. (Poorly documented code is more likely to be re-invented.)
In addition to describing the purpose of each routine, there is testing each routine as they are developed.
Each library has a regression file: a collection of tests that exercise each of the routines in the library.
Tests for each routine are written before or as the code for the routine is written, with the coding and testing for that one routine done hand-in-hand. (The idealised version is to write the tests first, but many case it’s more practical to develop both more-or-less concurrently.) The main thing is that testing is tightly tied to the development of each routine, at the time the routine is developed, rather than left as something to be done later. Every routine added to the libraries is tested before it is incorporated into the application code.
These tests can be readily (re-)run ’automatically’ to verify the library passes muster.
By repeatedly re-testing previously written code via the regression tests, you can catch any regressions–bugs you’ve accidentally introduced–early on. Ideally, you re-test immediately after writing any new code, sot that you then catch bugs you have accidentally introduced–well, you didn’t introduce them on purpose did you?!–immediately after the change in the code that introduced them, making it straight-forward to locate the likely cause.
Obviously the quality of the effort depends on how much effort you put into the tests. If you only test a routine once, with obvious output, while still having some value it’ll be a bit limited, to say the least.
This development approach takes time and means it takes longer to create the initial versions of routines and the project overall, but in my anecdotal experience it ensures fewer problems as the project grows in size to become more complex.
You’re left with a test suite that can be bundled with final software and run during installation of the executable elsewhere.
Software development in bioinformatics
My point here isn’t really the outline of the approach–which will be obvious to most programmers–but that it has me thinking about how much effort is made to test thoroughly that in-house data processing code works as intended using some formalised approach that tries, within reasonable limits, to cover all of the code and the main issues the code might have rather than something more ad-hoc.
(This is, of course, a good reason to use the better bioinformatics libraries! I’m leaving these aside as I’m interested in to the code written at each site.)
To me, the test suites and documentation are as much the job as writing the code.
Most high-level code has intuitively obvious logic. The bugs in code like this–in my experience–often lie in assumptions about arguments to routines or assumption about the actions of the routines themselves. By documenting accurately what routines are expected and achieve, or deliver, and testing the lower-level routines thoroughly, you’re able to reduce these bugs.
This approach does involve a lot work, but then the time taken to track down a stubborn bug in a large code base can be considerable. (It doesn’t help morale either.)
I probably should say, or confess, this is an approach I now use. I didn’t always.
 The development and testing approach depends on the nature of the project. For ‘pure’ algorithm development, the focus is on exploring variants of the algorithm to vary performance criteria rather than just correctness. (The literate and testing aspects still apply, but the focus isn’t limited to them.) Data-processing aspects–data pipelines, data analysis libraries and so on–fit well with what I describe here.
 Or arguments – pun intended here: let’s not argue over the meanings of these two words!
 This is a science forum. I have to painfully distinguish what is formally tested and what is really only relying on the prone-to-being-misleading ‘personal experience’ thing so some pedantic soul doesn’t leap on it…
Other articles on Code for Life (also click on the images):