Hurtling around around the tubes have been links to John Cook’s excellent short article Software exoskeletons. Relayed on some with legions of followers, it will have been widely read.
John contrasts the programming styles and objectives of two camps: ’There’s a major divide between the way scientists and programmers view the software they write.’ I encourage readers to read his piece; it’s short and well worth the thoughts it raises.
I’d like to offer a few thoughts on what he’s written.
In particular, I’m left thinking that this comes back to talking with others and thinking ahead as you design a project, before you get into the meat of it.
If an aim is to spin off an end-user application ideally this wants to be recognised and built into the project from the onset, so that the coding approach used reflects that the code later intends to be part of an application.
It’s a little unfair to over-analyse posts like this. They may have been written fairly quickly and without the spotlight that they gathered in mind. Blog posts are, after all, often explorations of nascent ideas.
I think his core point is excellent and the ‘lessons’–for the most part–very useful, and I like the snappy style they’re present in. That said, I’m a little wary of the opening passage.
I’m going to tip his post upside-down and run through it backwards, from his final word, past the lesson in the middle of the piece, back to opening dichotomy. It’s not just because I’ve got to be different somehow: there’s method in this madness!
My focus is on bioinformatics as it seems unwise to generalise about fields you’re not familiar with. In a similar way I’m going to leave aside the scientist/developers within pharmaceutical companies, who have a different environment to academia. It’s possible that my views differ to others because I have formal training in both computer science and biology.
The final word
Let me try capture the message of his final sentences: Those whose output are research data and papers often don’t appreciate how what is involved in converting (their) prototype or ‘one off’ software into end-user application software.
Here’s a loose-and-fast rule of thumb for non-programmers: the core bit of code that ‘crunches the numbers’ will be small minority of the final code in an end-user application. This particular figure won’t be ‘right’, but consider it something to get the general thinking: the analytical code might perhaps be 10-20% of the total code in the end-user application.
Most of the code will be dealing with all the things users can do (including coping with their erroneous decisions), feeding data to and from the core code, managing the input and output data files, and whatnot. Dealing with stuff is not trivial, it’s a lot of work.
In addition, central to John’s message, code for in-house use tends to be adapted–and limited–to the projects they were applied to. End-users will want to apply the code to other datasets for other purposes. This means that the core code needs to be tested and (re)developed with that in mind.
In bullet form, paraphrased:
- Those whose aim are to write research papers consider coding done once the data is generated.
- Those writing application software ’give more thought to reproducibility, maintainability, and correctness.’
- Those writing application software ’need to understand that sometimes a program really only needs to run once, on one set of input, with expert supervision.’ [I suggest replacing ‘run once’ for a ‘one kind of problem‘ for software that is run repeatedly, but only on particular datasets.]
- Those writing software for one-off project use ’need to understand that prototype code may need a complete rewrite before it can be used in production.’
(You’ll notice I’ve replaced ‘scientist’ and ‘programmer’–I’m leaving for the next section.)
There are generalisation, of course, with the usual problems generalisations have. In particular, while the last is spot on I worry over the aspects of the earlier ones, at least for some areas of bioinformatics. (Remember I’m not writing about science in general as he was, but bioinformatics.)
The key theme in John’s points are that how you code reflects your aims and when you shift from one aim to another–from creating ‘just’ one research output to creating an end-user application–you need to be aware that you’re asking for different coding requirements.
A couple of quick thoughts (with bioinformatics in mind):
Thought should be given to if the programmer/scientist is likely to reuse their code, especially as core tasks are often repeated in later projects. It’s useful to design the code so that it can be picked up again, and, if necessary, tweaked with new features, and whatnot. In this sense coding is not ‘done’ once the current paper is sent off and a more thoughtful and careful development process can (I would say, will) pay off in the long run despite it initially taking longer.
Scientists/programmers need to give thought to reproducibility, maintainability and correctness. (I’ve touched on these these in the past.) The former is just good science. The second relates in part to my previous paragraph. The latter is, again, just good science.
Of course, a catch is that these require funding and time…
John opened by comparing scientists and programmers: ’There’s a major divide between the way scientists and programmers view the software they write.’ [My emphasis added.]
You’ll notice I’ve been replacing ‘scientist’ and ‘programmer’ with the output they intended to generate. It’s the reason I’ve worked through what he wrote backwards, so that I might end with the dichotomy offered at the onset.
To my reading, his key point is based on what the intended outputs are, and if you re-phrase the opening dichotomy in terms of outputs it better fits his message: There’s a major divide between the way those who aim to create research papers and those who aim to create end-user applications view the software they write.
It isn’t as snappy, part of what I think appealed to so many–me too!–but I think on closer inspection it’s closer to what I believe his intended message is. (Among other things having inserted clauses in place of nouns the sentence wants to be re-structured to be clearer.)
I’m not trying to misrepresent him and I hope I’m not. I just think it’d be more productive to talk about it this way. (I confess I’m wary of cultural labels as they always seem to engender stereotypes and divisive thinking, something I have to admit I can’t stand.) Having acknowledged I’ve brutalised his message, I’m asking readers to let me get away with it!
Choosing the dividing line to be scientists and programmers requires readers to ’generalise away’ scientists that create applications, non-scientist programmers that general one-off data, and other people that straddle the two types of outputs.
On one hand, there have been commercial-style applications for a many years developed by bioinformatics scientists who have made a career of offering revisions on a particular packages or collections of tools in an on-going fashion. Some of these people are in academic settings, others commercial. (As you can see a part of my take on this is because I’m not dividing scientist/non-scientist up based on if they are in academia or not.)
By contrast, groups that code their own in-house software to get a dataset or paper out are often (even usually) in the mould John refers to.
Also, like all generalisations, there are a few exceptions that don’t fit the rule expressed in the my generalisation… generalisation never really work when pressed…
Final thoughts (mine, that is)
I’ve previously written (see links below) that sometimes biologists approach computational biologists after they’ve designed their experiment or even, in cases, after they’ve generated their data. It can have the effect that the computational biologist is left trying to ‘rescue’ the project, trying to put back into it what should really have been included from the onset.
In similar fashion, I’d like to re-iterate my thought that, if possible, if an aim is to spin off application software this ideally wants to be recognised at the onset, rather than tackled as an afterthought.
If this were done, the core code might then be developed in a more appropriate fashion and the transition to an end-user application proceed a little more gracefully.
1. I’m not going to argue over how many there are, who does what, etc., just say that there are some. For what it’s worth, I had originally hoped to do something similar myself based around prediction of protein functional sites, from an academic setting. Either that or working on gene regulation – it’s a long boring story…
2. I’m not saying without any hassles, that’d be unrealistic.
Other ruminations on bioinformatics on Code for life: