A concern raised, which I have some sympathies with, is how to make computational science reproducible.
Modern science is largely grounded on the notion that findings can be repeated independently by others to verify them, or to extend them. In practice this can be easier said that done. You’d think that for computational sciences, like computational biology, it’d be cut and dried. It can be, but a lot of the time it isn’t.
I’d like to point to three things discouraging development of reproducible research in computational biology and suggest that in addition to “open” coding and a suitable legal framework, self-documenting output that can be used as input may help.
To start as I did, read John Timmer‘s article Keeping computers from ending science’s reproducibility and the slides (PDF file) from Victoria Stodden’s talk Intellectual Property Issues in Publishing, Sharing and Blogging Science.
It’s claimed that the term “reproducible research” was proposed by Jon Claerbout (Standford University) to encapsulate the idea that “the ultimate product of research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. necessary for reproduction of the results and building upon the research.” What both Timmer and Stodden write about follows this description and the abstract of the paper that the wikipedia entry points to:
WaveLab is a library of Matlab routines for wavelet analysis, wavelet-packet analysis, cosine-packet analysis and matching pursuit. […]
WaveLab makes available, in one package, all the code to reproduce all the figures in our published wavelet articles. The interested reader can inspect the source code to see exactly what algorithms were used, how parameters were set in producing our figures, and can then modify the source to produce variations on our results.
WaveLab has been developed, in part, because of exhortations by Jon Claerbout of Stanford that computational scientists should engage in “really reproducible” research.
The full paper is available free on-line, via the abstract. (This is well worth reading if you haven’t already. It’s a light read if you skip lightly over the details of the product in section 4 & 5 and focus on the discussion.)
In reading on, bear in mind reproducible research is more than just the code itself, but the particular data set as input, including all the parameters to the methods employed.
For the sake of simplicity, I’m only going to consider software that takes all it’s inputs as an all-at-one-time collective, rather than the interactive approach where one button click or change of an parameter results in an immediate “sub-problem” response.
Discouragement 1: Development for academic paper, not product
Many new methods, if not most, are not developed as a “product”. At least initially they are developed to generate an academic paper.
An upshot of this, is that the implementation is intended only to go as far as demonstrating the utility of the new algorithm or concept.
These efforts are never really designed for others’ use as a product.
In the survey Stodden’s slides cite, the main reason given not to share the software or data is “Time to document and clean up” (see the slide “Top Reasons Not to Share”: 77% for code and 54% for data from a survey of American academics registered at top Machine Learning conference).
If the work is only going to achieve an academic publication, cleaning up the code (or data) and documenting it won’t give any further reward.
There can be more incentive if the work is intended to be distributed, but in most cases this will be more a matter of pride than anything else as that the academic software (or data) was distributed is not easily measured as a career milestone (although easy to put on your CV).
In my experience very few academic computational efforts are properly documented.
(As an aside, at one point I considered doing technical writing–i.e. documentation–as a sideline to my consultancy work, partly because this issue both interested and annoyed me enough, and because I saw it as a problem that needed solving, which is what a business mind looks for. I developed plans for new approach to documentation, which–who knows?–I may return to.)
Discouragement 2: Web-hosted development doesn’t encourage portability or documentation
Many bioinformatics methods when offered as products are offered as web-hosted services.
One “advantage” of a web-hosted service is that it really need only be implemented on the one computer. There’s no need for the software to be made portable or “tidy”. Pop it up on the server, the users only see the GUI and all the ugliness of the hacked code is invisible.
Provided you retain staff, documentation can be limited too (in some cases even absent: ugh).
Even if you offered such software as open-source, it’d probably be of little practical use to anyone else to extend or remodel. It’d take too long to figure out what it does. (This can have a side effect on the field that people are more inclined to re-invent than extend.)
While web-hosted services are good for end-users, easy on research budgets and fit the existing reward system better, it encourages practices that are not so good for reproducibility.
Discouragement 3: Formal testing of academic software is often poor or absent
This is one of my peeves, deserving of an article in it’s own right and a good beating!
Leaving the broader issues for another day, test data sets can be a useful component of reproducibility, to understand what a program does, what inputs are expected to achieve what outcomes and so on.
Important here is that they document the cases covered in testing, the cases where the method can be shown to be sound, and what might not have been considered that is open to further examination.
Related to this, is that credit in publication is (mostly) for the concept and it’s demonstration, not any testing. It’s rare to see bioinformatics papers report formal testing.
Case example: Output documenting the input, with output able to be recycled as input
During my Ph.D. studies I wrote a “database program” to analyse motifs in protein sequences (or DNA or RNA).
At the time this was developed most programs where command line or driven by command files of various sorts. (X Windows was around, but it was considered something of an extra to present a GUI.)
One feature of this software is relevant here: the input was specified as a short command file specifying the various parameters; this input file and and all relevant internal values were reproduced in a special commented section in the output.
The program was designed to be able to take it’s own output files as input and (re-)generate a new output file.
This proved very valuable for several reasons:
- You already have a key element of a test suite mechanism.
- If you lose the input file, it doesn’t matter, it’s in the output anyway.
- The output is self-documenting (including it’s own filename, the date executed, etc.)
- You can re-run an analysis to test if a dataset has been altered.
And so on.
This provides reproducibility of analyses (as opposed to code).
It also means that it’s straight-forward to design the system to re-run previous analyses, for batch processing, for others to reproduce the analysis on other datasets, or for others to use the system without having to learn the full set of options (using “pre-packaged” command sets for specific tasks).
Obviously the example uses an old coding style, but the concepts are what I want to convey here.
This all seems obvious to the point that I worry this sounds horrible and condescending, but I’ve rarely seen academic software developed so that the output is truly self-documenting, which seems to me to be a key requirement for the reproducibility that Timmers and Stodden refer to.
One reason that this isn’t done I suspect is simply the extra effort involved and that you really have to develop the software to work this way from the onset; generally speaking, it’s a pain to try graft this sort of thing onto existing code later on.
As “food for thought”, I’d suggest:
(a) seriously consider at not taking input direct from GUIs, but putting in place an intermediary step that gathers the GUI input into a “command” file (XML, plist, etc.), which is then feed to the analytical code, which becomes essentially a “batch-style” analytic back-end.
(You could use an internal data structure in lieu of an external format (i.e. or file), but an external format can be feed to the software directly and used to build up test cases, etc. for the analytical portion of the code independent of the GUI.)
(b) that output be able to be generated in a “plain text” form that includes all of the input parameters, so that it
(c) can be re-used as input with no further parameters needed (i.e. the output entirely self-documents the process conducted).
You’ll note that the latter implies an interface that is able to accept a text file as the sole “instructional” input. I prefer a command-line interface for this.
3. Be aware that I’m talking about software for specific analyses. I’m not talking about large systems with dozens of parameter files and hundreds (or thousands) of internal options! Trying to cram all of that into output files would require a different approach unless you didn’t mind the size of the extra “baggage”.
Other computational biology / bioinformatics posts at Code for life:
Retrospective: The mythology of bioinformatics (very popular article)
Bioinformatics — computing with biotechnology and molecular biology data (should really be read in conjunction with above, but few do!)
(All graphics without source are from Wikipedia, with the exception of the simple program flowchart, which is the author’s work. I haven’t read the comments to Timmer’s ArsTechnica piece at this time.)