By Grant Jacobs 24/01/2016

Apparently researchers who use other’s data without collaborating with them are ‘research parasites’, according to an editorial from the New England Journal of Medicine. This has caused quite a fuss, with some expressing their opinion on the twitter streams #dataparasites and #researchparasites. I think the editorial conflates data sharing with other issues.

Data sharing is an important part of science, but the editorial seems to conflate sharing samples and collaboration, with data availability.

The editorial is focused on data from medical studies. This is appropriate for a medical journal, but much of the on-line fuss is centred around wider use of data sharing in genomics and science in general. I’ll naturally lean to computational molecular biology, as that’s my field.

The editors worry that others who use the data may not appreciate details:

However, many of us who have actually conducted clinical research, managed clinical studies and data collection and analysis, and curated data sets have concerns about the details. The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.

With all respect, my experience it (also) often happens the other way around, with later researchers considering details the original authors did not.

Good computational biologists are able leverage data in their area of biology. This uses a combination of a deep understanding the (theoretical*) biology of the area, along with understanding of what different algorithms or data analysis techniques might reveal. You need to understand how the tools really work, not just how to ‘use’ them.

People with deep cross-disciplinary backgrounds can, sometimes, see that more can be gleaned from a dataset, or that there are issues with the initial analysis offered in the original publication. Both good stuff for science.

The editors go on to wring hands over use of others’ data:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

Again with all respect, people who analyse data are not a new class of researcher. Researchers have been doing that for a long time, and have made important contributions this way, including correcting initial impressions.

The editors round out their argument by giving an example of what they consider good practice in their current issue. A catch is this involves sharing tissue samples, not just data:

This issue of the Journal offers a product of data sharing that is exactly the opposite. The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own. […] To assess the clinical value of this potential biomarker, they needed a sufficiently large group of patients whose archived tissues could be used to assess biomarker expression and who had been treated in relatively homogeneous way.

(Emphasis mine.)

I’m sure the linked article is an excellent piece of work, and I mean to take nothing away from it in my remarks here. It sounds as if the teams worked together well to good effect.

But I think that the editorial is conflating two different things with data sharing. Firstly, they seem to consider sharing of samples to be sharing of data, when most researchers I know—at least from my field—would consider the two separate. Secondly, I feel they confuse when a collaboration is appropriate in arguing that it should ‘always’ be done.

Let’s focus on the second and on data sharing. ‘Holding’ samples has issues too, but I’m not well placed to comment on that.

Collaborations should only happen when there is a real basis for collaborating. There should be some active research element for both parties, or, at the very least, some substantial assistance with techniques or the like. The idea that data can ‘buy’ authorship is dodgy and open to abuse.

This could be viewed in the context of credit systems for authorship. If the data-holding team could only put down availability of data as their contribution, you’d hope that journals would consider that too weak grounds for full authorship. (It would be interesting to hear the NEJM editors’ comments on that.)

Publications should always acknowledge data and sample sources, no matter if a collaboration is undertaken or not. That goes without saying. If the other team has assisted in making the data or samples available, this should also be noted.

In the case of needing explanations of data, to me this reflects if proper data documentation has been done. There’s a cliché that a project is not done until it’s documented. If too much explanation is needed, it may reflect that the materials & methods supplements for the research paper were insufficient.

I favour something akin to literate programming where documentation is written during the design phase, and elaborated as the project proceeds. Among other things, it means that you’re not left trying to tackle documentation after-the-fact when you naturally want to be moving on to the next project.

The concern of being out-competed on follow-on work is an old theme, predating modern data sharing. It particularly comes when generating the data has been a substantial effort.

One example is when atomic-resolution macromolecular structures are years of work. In the 1990s that community elected to set up an embargo process, with submission to a database (the Protein Data Bank) in exchange for an accession ID, which enabled publication. After the embargo time was up, the data is released by the database.

Some in genomics may not know that DNA sequence data was not always released in the 1980s and earlier. I can recall entering data by hand from figures in publications because they were the only source of that data. (Exacting and very time-consuming!) What turned that corner was journals not publishing unless an database accession for the data was supplied.

A key element in the past over data sharing issues has been tying data submission to accession IDs and accession IDs to publication.

That has worked for particular types of core data: molecular sequences and structures. How workable that is for data in general is an open question, but the pattern may be worth noting.

Submission of the data behind a project to the journal carries out a similar step, but without the separate storage in a database and the database accession ID. It’s a fraught problem. Where to store stuff and how.

A problem with embargoes is that they stymie progress: there is a lot to argue for immediate data release on publication. I’m not going to enter that argument fully here but you do at some point have to let the data go. Better sooner than later.

If there is scope for collaboration it ought to come regardless. If there are competitors that close to the data, shouldn’t you be hiring them or initiate a collaboration with them? (Before developing the data, that is.) And if the key idea behind a new analysis is one you didn’t think of, all you’d have done by holding the data is hold up progress.

Fear of quick use of data might suggest the follow-on analysis isn’t that original, is fairly straight-forward, or that others are better placed to do it. (I have sympathies with less well-funded groups in the latter respect.)

Science as an endeavour needs the data.

Speaking for my area, in molecular biology comparative analyses have contributed hugely. These analyses rely on the data being readily available. Similarly, integrating diverse datasets has made contributions.

Perhaps don’t be the one that stymies progress, and work with people rather than against them?


* One sticking point can be not fully appreciating the experimental techniques.

Philip Bourne is Associate Director for Data Science (ADDS) at the NIH. I should update what has been done there over the past year or so, but regrettably I haven’t time.

The image is of “Cymothoa exigua, or the tongue-eating louse, is a parasitic crustacean of the family Cymothoidae. The parasite enters fish (here a Sand steenbras, Lithognathus mormyrus) through the gills and then attaches itself to the fish’s tongue.” Author: Marco Vinci. Source: Wikipedia. Creative Commons Attribution-Share Alike 3.0 Unported.

Other articles in Code for life:

Reproducible research and computational biology

External (bioinformatics) specialists: best on the grant from the onset

Developing bioinformatics methods: by who and how

Retrospective: The mythology of bioinformatics

Bioinformatics — computing with biotechnology and molecular biology data

Animating our DNA

Friday picture: molecular modelling of the cytoplasm

0 Responses to “Data parasites eh?”

  • Below are a few related post for anyone looking for further reading (or procrastination…)

    Before these, it ought to be pointed out that one of the editors (Jeffrey Drazen) has “endorsed a new proposal from the International Committee of Medical Journal editors calling for all researchers to make their data publicly available within six months” – as pointed out at Retraction Watch: (See: for the ICoMJE proposal.)

    Ronald Bailey: Stunning Rejection of Scientific Values of Transparency and Skepticism at New England Journal of Medicine

    Leonid Schneider: Research “parasitism” and authorship rights

    David Shaywitz: Data Parasites?

    Derek Lowe: Attack of the Research Parasites

    • I know the story—it’s pretty well-known!—but being down in the footnotes I thought I could leave it. (Better as a separate post, etc.) Cool photo though. Was looking for something that might fit, saw that and thought “yeah, that’ll do”.

  • More on topic, data reuse is a big issue, even in my area of science. At least one factor is that authors seem to want money before releasing published data for reuse. Often that money comes from public science funding. It is a bit analogous (though by no means a perfect analogy!) with royalties for public performances of published music (though without the creativity!)

    • Never heard of people asking for money for data release. That’d be as bad as asking for authorship.

      There is a pragmatic issue of archival of data: where, who funds it, on-going maintenance, etc. Big issue in the biomedical sciences – hence NIH initiative I mentioned in the Footnote.

  • It is actually way more complicated. I said “authors seem to want money before releasing published data for reuse”, but better to say publishers, not authors (though authors can still benefit economically and in other ways from the deal). This has nothing to do with archival of data. This might get a bit long-winded, but:

    In taxonomy, at least, there is currently a big push for what is called “open access publishing”. I’m not sure how closely this mirrors other areas of science? Anyway, OA is usually pitched as being for the public good, along the lines of “the public should not have to pay again to read the results of publicly funded research”, as if the public is currently somehow paying twice, but only once under OA. But it is all a load of complete bollocks Grant!

    Interestingly, OA is being developed not in a straightforward way. It would make most sense for publishers to get funding directly from the public purse to fund OA (i.e. replace the profits that they would no longer get from subscriptions). Instead, authors pay publishers for OA, from the author’s funding. What difference does that make, you ask? Well, it opens up two rather alarming (to me) possibilities:

    (1) It could mean that authors with little or no funding can no longer afford to publish, however good their work may be; and

    (2) it opens up the possibility of a scenario along the following lines: consider researchers employed by institutions who claim “overheads” on grants awarded to the researchers. Suppose that the area of science involved is rather time consuming and not very profitable. If the researcher can ditch some grant money on OA fees, then they can say “OK, the funding has run out on this project now – time to move on to the next grant”.

    If OA funding went straight to the publisher (i.e. not via authors), then publishing would be truly free and open for all. In theory, papers would be published or not depending only on the results of standard peer review. A good paper would get published regardless of how well or poorly funded the author was. It would be naive to think that good researchers would inevitably be well funded.

    Will resume this post in a bit …

    • “Will resume this post in a bit …” Just a heads-up, I’m unlikely to take you up on this as I’m familiar with the OA story – read Stephen Curry on this, for example, and it’s a bit off-topic really (unless you want to focus on the data sharing aspects).

      (I replied to what you wrote, which didn’t seem to have any connection to OA.)

  • You might be wondering what OA has to do with data reuse? The two things are very closely linked. A truly OA paper has unrestricted data reuse. It isn’t just about being able to read the paper. Basically, you can do what the heck you like with it as long as the publisher has been sufficiently compensated. Publishers are currently blocking data reuse. This isn’t a big deal to the general public, but it is to other researchers and to big data aggregating initiatives.

    Anyway, I think that OA is a very bad idea, at least for taxonomy. Most papers in taxonomy are only of any interest to a few specialists (and to data aggregators, but for what purpose?) Publishers are very likely to set OA fees at overly “optimistic” estimates of likely readership. Higher impact and/or more “prestigious” journals are going to set OA fees higher. Authors who may want to ditch public research money (for the reasons explained above) may be tempted to opt for the more expensive journals (the economics of spending one’s own money is very different to that of spending someone else’s, e.g. public funding). The papers will get few or no more reads than than they would have as pay to read papers, but more public funding intended for research will simply be diverted to publisher’s profits. Not good!

    Another complicating factor is what about already published stuff that is still under copyright. Chances are, we (the public) are still going to have to pay subscriptions to read these, on top of OA fees for new stuff.

    I find all these goings on (wheelings and dealings) to be rather unpalatable. However, taxonomy in particular is (apparently) having a hard time remaining economically viable in today’s world. I guess this might be the only way to prevent it from disappearing altogether? But pitching it in terms of “the public should not have to pay again to read the results of publicly funded research” just seems wrong to me.

  • I have only now read your reply at 1:58 pm. I hope I have made it at least somewhat clearer above what the connection is between OA and data reuse. I suppose that “data sharing” might not be quite the same as “data reuse”. Data reuse is about the freedom to use published data for any purpose. “Data sharing” could include unpublished data.

    Anyway, as I hope I have already made clear, economic factors are blocking data reuse. The public may end up paying for researchers to be able to reuse data published by other researchers. The public derives little or no benefit from this arrangement. It is therefore quite misleading to pitch OA in terms of the public good.

  • PS: So, basically, whatever I have said about OA applies pretty much equally to data reuse. One final comment: The pitch “the public should not have to pay again to read the results of publicly funded research” implies that the public is paying twice without OA, but only once with OA. That is true, but highly misleading! It doesn’t represent a saving! Paying $40 in each of two instalments is paying twice. Alternatively, paying $100 in one go is only paying once, but it isn’t a saving of money!