Reproducible research and computational biology

By Grant Jacobs 24/01/2010 18


A concern raised, which I have some sympathies with, is how to make computational science reproducible.

Modern science is largely grounded on the notion that findings can be repeated independently by others to verify them, or to extend them. In practice this can be easier said that done. You’d think that for computational sciences, like computational biology, it’d be cut and dried. It can be, but a lot of the time it isn’t.


Musical repeat sign

I’d like to point to three things discouraging development of reproducible research in computational biology and suggest that in addition to “open” coding and a suitable legal framework, self-documenting output that can be used as input may help.

To start as I did, read John Timmer‘s article Keeping computers from ending science’s reproducibility and the slides (PDF file) from Victoria Stodden’s talk Intellectual Property Issues in Publishing, Sharing and Blogging Science.

It’s claimed that the term “reproducible research” was proposed by Jon Claerbout (Standford University) to encapsulate the idea that “the ultimate product of research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. necessary for reproduction of the results and building upon the research.” What both Timmer and Stodden write about follows this description and the abstract of the paper that the wikipedia entry points to:

WaveLab is a library of Matlab routines for wavelet analysis, wavelet-packet analysis, cosine-packet analysis and matching pursuit. […]

WaveLab makes available, in one package, all the code to reproduce all the figures in our published wavelet articles. The interested reader can inspect the source code to see exactly what algorithms were used, how parameters were set in producing our figures, and can then modify the source to produce variations on our results.

WaveLab has been developed, in part, because of exhortations by Jon Claerbout of Stanford that computational scientists should engage in “really reproducible” research.

The full paper is available free on-line, via the abstract. (This is well worth reading if you haven’t already. It’s a light read if you skip lightly over the details of the product in section 4 & 5 and focus on the discussion.)

In reading on, bear in mind reproducible research is more than just the code itself, but the particular data set as input, including all the parameters to the methods employed.

For the sake of simplicity, I’m only going to consider software that takes all it’s inputs as an all-at-one-time collective, rather than the interactive approach where one button click or change of an parameter results in an immediate “sub-problem” response.

Discouragement 1: Development for academic paper, not product

Many new methods, if not most, are not developed as a “product”. At least initially they are developed to generate an academic paper.

An upshot of this, is that the implementation is intended only to go as far as demonstrating the utility of the new algorithm or concept.

These efforts are never really designed for others’ use as a product.

In the survey Stodden’s slides cite, the main reason given not to share the software or data is “Time to document and clean up” (see the slide “Top Reasons Not to Share”: 77% for code and 54% for data from a survey of American academics registered at top Machine Learning conference).

If the work is only going to achieve an academic publication, cleaning up the code (or data) and documenting it won’t give any further reward.

There can be more incentive if the work is intended to be distributed, but in most cases this will be more a matter of pride than anything else as that the academic software (or data) was distributed is not easily measured as a career milestone (although easy to put on your CV).

In my experience very few academic computational efforts are properly documented.

(As an aside, at one point I considered doing technical writing–i.e. documentation–as a sideline to my consultancy work, partly because this issue both interested and annoyed me enough, and because I saw it as a problem that needed solving, which is what a business mind looks for. I developed plans for new approach to documentation, which–who knows?–I may return to.)

Discouragement 2: Web-hosted development doesn’t encourage portability or documentation

Many bioinformatics methods when offered as products are offered as web-hosted services.

web2One “advantage” of a web-hosted service is that it really need only be implemented on the one computer. There’s no need for the software to be made portable or “tidy”. Pop it up on the server, the users only see the GUI and all the ugliness of the hacked code is invisible.

Provided you retain staff, documentation can be limited too (in some cases even absent: ugh).

Even if you offered such software as open-source, it’d probably be of little practical use to anyone else to extend or remodel. It’d take too long to figure out what it does. (This can have a side effect on the field that people are more inclined to re-invent than extend.)

While web-hosted services are good for end-users, easy on research budgets and fit the existing reward system better, it encourages practices that are not so good for reproducibility.

Discouragement 3: Formal testing of academic software is often poor or absent

This is one of my peeves, deserving of an article in it’s own right and a good beating!

Manual punchcard device (IMB). Imagine the bugs you could create with this!
Manual punchcard device (IBM). Imagine the bugs you could create with this!

Leaving the broader issues for another day, test data sets can be a useful component of reproducibility, to understand what a program does, what inputs are expected to achieve what outcomes and so on.

Important here is that they document the cases covered in testing, the cases where the method can be shown to be sound, and what might not have been considered that is open to further examination.

Related to this, is that credit in publication is (mostly) for the concept and it’s demonstration, not any testing. It’s rare to see bioinformatics papers report formal testing.

Case example: Output documenting the input, with output able to be recycled as input

During my Ph.D. studies I wrote a “database program” to analyse motifs in protein sequences (or DNA or RNA).

SW-design

At the time this was developed most programs where command line or driven by command files of various sorts. (X Windows was around, but it was considered something of an extra to present a GUI.)

One feature of this software is relevant here: the input was specified as a short command file specifying the various parameters; this input file and and all relevant internal values were reproduced in a special commented section in the output.

The program was designed to be able to take it’s own output files as input and (re-)generate a new output file.

This proved very valuable for several reasons:

  • You already have a key element of a test suite mechanism.
  • If you lose the input file, it doesn’t matter, it’s in the output anyway.
  • The output is self-documenting (including it’s own filename, the date executed, etc.)
  • You can re-run an analysis to test if a dataset has been altered.

And so on.

This provides reproducibility of analyses (as opposed to code).

It also means that it’s straight-forward to design the system to re-run previous analyses, for batch processing, for others to reproduce the analysis on other datasets, or for others to use the system without having to learn the full set of options (using “pre-packaged” command sets for specific tasks).

Obviously the example uses an old coding style, but the concepts are what I want to convey here.

Concluding thoughts

This all seems obvious to the point that I worry this sounds horrible and condescending, but I’ve rarely seen academic software developed so that the output is truly self-documenting, which seems to me to be a key requirement for the reproducibility that Timmers and Stodden refer to.

One reason that this isn’t done I suspect is simply the extra effort involved and that you really have to develop the software to work this way from the onset; generally speaking, it’s a pain to try graft this sort of thing onto existing code later on.

As “food for thought”, I’d suggest:

(a) seriously consider at not taking input direct from GUIs, but putting in place an intermediary step that gathers the GUI input into a “command” file (XML, plist, etc.), which is then feed to the analytical code, which becomes essentially a “batch-style” analytic back-end.

x-windows-gui_low(You could use an internal data structure in lieu of an external format (i.e. or file), but an external format can be feed to the software directly and used to build up test cases, etc. for the analytical portion of the code independent of the GUI.)

(b) that output be able to be generated in a “plain text” form that includes all of the input parameters, so that it

(c) can be re-used as input with no further parameters needed (i.e. the output entirely self-documents the process conducted).

You’ll note that the latter implies an interface that is able to accept a text file as the sole “instructional” input. I prefer a command-line interface for this.

3. Be aware that I’m talking about software for specific analyses. I’m not talking about large systems with dozens of parameter files and hundreds (or thousands) of internal options! Trying to cram all of that into output files would require a different approach unless you didn’t mind the size of the extra “baggage”.


Other computational biology / bioinformatics posts at Code for life:

External (bioinformatics) specialists: best on the grant from the onset

Developing bioinformatics methods: by who and how

More on ’What is a computational biologist?’ (and related disciplines)

Retrospective: The mythology of bioinformatics (very popular article)

Bioinformatics — computing with biotechnology and molecular biology data (should really be read in conjunction with above, but few do!)

Computational biology: Natural history v. explanatory models

(All graphics without source are from Wikipedia, with the exception of the simple program flowchart, which is the author’s work. I haven’t read the comments to Timmer’s ArsTechnica piece at this time.)


18 Responses to “Reproducible research and computational biology”

  • Hi Grant,

    A few thoughts from someone whose bioinformatics tinkering extends as hacking together a few scripts to make my life easier. Following your numbering:

    1) Surely publishers and funders have a role here? The reason NCBI’s databases are so complete is that you have to include you sequences before you can publish the paper. Much the same thing is happening with phylogenetic/systematic journals – requiring machine readable tree files and the like. Of course the problem from the comp. biologists point of view is that would make more work for the same reward!

    2) Web2.0 at least has the potential to help with this. Tools like Galaxy let you make re-usable pipelines that include whatever analysis software you want. Similarly, most of the web based tools I use have some sort of automated interface so you can use your favourite scripting language as a ‘glue language’. Perhaps not the complete self-documentation you want but at least gives someone else the opportunity to go through the same steps you have.

    Also, I’ve not yet played with Sewave but it seems to offer some of the hope of self-documentation- at least at the actually doing the tests end of analysis.

    As I say, I’m not a bioinformaticst much less a computation biologist (see, I’ve been reading!) so take my thoughts with as many grains of salt as you feel necessary :)

  • I guess there are quite good resources and tools available for a reproducible computational research in life science domain. Only thing which comes in between is poor project management skills of computational scientists and lack of awareness. As David suggest about Galaxy and Sewave, plus tools like Taverna, myExperiment, KNIME are designed to help any computational biologist to keep track of provenance. Similarly markup languages such as SBML, CellML, Predictive Model Markup Language (PMML) are facilitating the reproducible research in modeling and data mining community. At software development part it always advised to use best practice for instance using version control, OOD, UMLs, trackers, unit testing, release system but only if time and resources permits. Did anyone figured out in terms of reproducibility how computational scientists performed compared to wet lab counterparts?

  • I’ll reply properly later, as I’m extremely busy, but neither comment is quite on target about what I was writing.

    It might help to read the articles I started from and esp. the comments to Timmer’s ArsTechnica article, as I’m extending from these articles; you might have a better feel for where this is coming from then—?

    David appears to be coming from something close to a “true” end-users point of view. I was writing about mainly the tools in and of themselves. It’s true there are also issues with in-house computational research being replicated elsewhere (i.e. non end-user stuff), and yet another level where end-users are using the tools but not developing them. They’re three different activities, with different issues. I’ll get back to this. (If I remember and find time!)

    Abhishek, I was writing about the nature of what’s finally delivered when someone presents a new method, rather than the development process. Not saying there are no issues with development itself, just that’s that not the key point I was after.

    For example, my reference to testing wasn’t about the testing process as it occurs during the development, which I tried to put aside as not being what I was after (“Leaving the broader issues for another day”), but that the test suites, etc., need to be included in the package that’s finally delivered so that others know what testing has been done if they want to reproduce or extend the method.

    I’m not disagreeing with the points you make, but even if you use the alphabet soup of tools you mention in the process of making a product, they won’t matter if the end product can’t be used in a reproducible way. What I’m writing about not really about whether you use all the “proper” tools or not, but if the package delivered can be used reproducibility in the hands of others.

    That said, it’s true that there can be a considerable difference between academic concept development and commercial software practice. This isn’t necessarily “wrong”, but it’s something that people need to be aware of. (A thought experiment: consider that the case example I gave was written before most of the tools you mention existed, as were much earlier bioinformatics methods that are still with us (usually in revised form, though!).)

  • Hi Abhishek,

    Sorry if my reply was a bit over-worked; I’m rushed off my feet here. Your thoughts are good, but I do think there is a difference between “good development practice” and “reproducible research”. The former, by itself, doesn’t ensure the latter. I wanted to only write about a few bits that I thought were problematic and particular to computational biology / bioinformatics, and leave out those that are common to other areas (outside of computational biology) or which didn’t have issues, hence not bringing up some of the tools that you mention. (Otherwise I’d be at it all day…!)

  • Hi Grant,
    No worries. Here is what I think,
    I don’t think that development process and end product “reproducible research” are two different things, they look slightly unrelated but they depend on each other. The logic that the better development practice doesn’t ensure reproducible research is certainly not universal. I will explain my point with another example which I think explains the John Timmer’s rational behind putting the reproducibility into computational code. Most of mathematical modelers use MATLAB as programming interface and even they provide the MATLAB scripts for their mathematical models they can not ensure the reproducibility. Why? Because script does not provide any information about parameter settings, package dependencies, version and type of simulator. Now when we have standards such as SBML and CellML which not only encode the models as xml format but also different information associated with model such as simulation, annotation meta data, why we are hesitating to embrace this? In fact by adopting SBML/CellML as their core standard and best practice more than 10+ modeling softwares had allowed their users to capture each aspect of their model development process, as final product these XML files can be submitted to journals. Systems Biology journals such as MSB, BMC, PCB and Bioinformatics already asking authors to submit their models in SBML or CellML formats to ensure that simulation results are reproducible. As for as computational biology is concerned sometime I feel the whole discussion on reproducibility is overrated.

  • […] John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.” […]

  • Tiwari,

    I’ll try get back to you properly in a couple of days. Sorry I haven’t sooner. I’m mainly talking about the “traditional” bioinformatics software that’s either used standalone or in a pipeline, e.g. seq. alignments, seq. searches, id. paralogs/orthologs, etc. Ditto for struct. biol. counterparts.

  • Just a loose thought: Talking about reproducibility in terms of absolutes probably may not be helpful? Some might argue that in pragmatic terms, reproducibility is by degree, how much effort is involved. (I realise this moves it to a nebulous “how much is enough”. I’m not saying I agree with, just tossing it in.)

    I think a little care is needed when pointing at one tool or another as if it’s going to be a magic bullet :-) It’s also a matter of how you use the tools. If you ignore whatever reproducibility features they have…

    With that in mind, I think no matter what tools are developed, the “real” problem is finding a (nice!) way to make it in academic researchers’ interest to do this.

    Sticking my neck out a bit, I’d quibble (in an academic fashion…) that standardisation is closer to what the MLs offer, or this is what I would have thought was their “point of difference” compared to other approaches. Consider my case example: I had reproducibility going. (Well, bits of it…) Sure it was unique to that program! No two ways about that… But it was *there*. (In the late 1980s, too! Got to give myself some credit…) One-off implementations or reproducibility features, like mine, can achieve reproducibility with respect to (that implementation of) that product. Standardisation helps wider adoption, hence the standardised MLs.

    Finally, Tiwari mentions that some journals are calling for SBML/CellML for systems biology. Wouldn’t this be an acknowledgement that there is a problem, not that it’s not an issue? (I don’t work in systems biology, btw.) I can’t imagine that they’d see the need to ask for it otherwise. Crystallography and other areas have faced this too. Food for thought: a number of years ago, there was considerable fuss over bioinformatics journals not requiring that the software be made available at all at the time of publication.

    I can’t help but think that this is a general problem that continually re-surfaces. At some point in time people decide it’s too much of an issue and “demand” that the journals (or databases) insist on additional material to resolve the issue. And then a few years later it recurs, but about some other aspect. (Note I’ve shifted to a broader view that just software.) To me the underlying reason is probably the conflicts between the driving goals of academic researchers doing the work and others’ interests in their work.

    (This has gotten long enough that perhaps I ought to resurrect it as a blog post… sigh.)

Site Meter