Apparently researchers who use other’s data without collaborating with them are ‘research parasites’, according to an editorial from the New England Journal of Medicine. This has caused quite a fuss, with some expressing their opinion on the twitter streams #dataparasites and #researchparasites. I think the editorial conflates data sharing with other issues.
Data sharing is an important part of science, but the editorial seems to conflate sharing samples and collaboration, with data availability.
The editorial is focused on data from medical studies. This is appropriate for a medical journal, but much of the on-line fuss is centred around wider use of data sharing in genomics and science in general. I’ll naturally lean to computational molecular biology, as that’s my field.
The editors worry that others who use the data may not appreciate details:
However, many of us who have actually conducted clinical research, managed clinical studies and data collection and analysis, and curated data sets have concerns about the details. The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.
With all respect, my experience it (also) often happens the other way around, with later researchers considering details the original authors did not.
Good computational biologists are able leverage data in their area of biology. This uses a combination of a deep understanding the (theoretical*) biology of the area, along with understanding of what different algorithms or data analysis techniques might reveal. You need to understand how the tools really work, not just how to ‘use’ them.
People with deep cross-disciplinary backgrounds can, sometimes, see that more can be gleaned from a dataset, or that there are issues with the initial analysis offered in the original publication. Both good stuff for science.
The editors go on to wring hands over use of others’ data:
A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”
Again with all respect, people who analyse data are not a new class of researcher. Researchers have been doing that for a long time, and have made important contributions this way, including correcting initial impressions.
The editors round out their argument by giving an example of what they consider good practice in their current issue. A catch is this involves sharing tissue samples, not just data:
This issue of the Journal offers a product of data sharing that is exactly the opposite. The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own. […] To assess the clinical value of this potential biomarker, they needed a sufficiently large group of patients whose archived tissues could be used to assess biomarker expression and who had been treated in relatively homogeneous way.
I’m sure the linked article is an excellent piece of work, and I mean to take nothing away from it in my remarks here. It sounds as if the teams worked together well to good effect.
But I think that the editorial is conflating two different things with data sharing. Firstly, they seem to consider sharing of samples to be sharing of data, when most researchers I know—at least from my field—would consider the two separate. Secondly, I feel they confuse when a collaboration is appropriate in arguing that it should ‘always’ be done.
Let’s focus on the second and on data sharing. ‘Holding’ samples has issues too, but I’m not well placed to comment on that.
Collaborations should only happen when there is a real basis for collaborating. There should be some active research element for both parties, or, at the very least, some substantial assistance with techniques or the like. The idea that data can ‘buy’ authorship is dodgy and open to abuse.
This could be viewed in the context of credit systems for authorship. If the data-holding team could only put down availability of data as their contribution, you’d hope that journals would consider that too weak grounds for full authorship. (It would be interesting to hear the NEJM editors’ comments on that.)
Publications should always acknowledge data and sample sources, no matter if a collaboration is undertaken or not. That goes without saying. If the other team has assisted in making the data or samples available, this should also be noted.
In the case of needing explanations of data, to me this reflects if proper data documentation has been done. There’s a cliché that a project is not done until it’s documented. If too much explanation is needed, it may reflect that the materials & methods supplements for the research paper were insufficient.
I favour something akin to literate programming where documentation is written during the design phase, and elaborated as the project proceeds. Among other things, it means that you’re not left trying to tackle documentation after-the-fact when you naturally want to be moving on to the next project.
The concern of being out-competed on follow-on work is an old theme, predating modern data sharing. It particularly comes when generating the data has been a substantial effort.
One example is when atomic-resolution macromolecular structures are years of work. In the 1990s that community elected to set up an embargo process, with submission to a database (the Protein Data Bank) in exchange for an accession ID, which enabled publication. After the embargo time was up, the data is released by the database.
Some in genomics may not know that DNA sequence data was not always released in the 1980s and earlier. I can recall entering data by hand from figures in publications because they were the only source of that data. (Exacting and very time-consuming!) What turned that corner was journals not publishing unless an database accession for the data was supplied.
A key element in the past over data sharing issues has been tying data submission to accession IDs and accession IDs to publication.
That has worked for particular types of core data: molecular sequences and structures. How workable that is for data in general is an open question, but the pattern may be worth noting.
Submission of the data behind a project to the journal carries out a similar step, but without the separate storage in a database and the database accession ID. It’s a fraught problem. Where to store stuff and how.
A problem with embargoes is that they stymie progress: there is a lot to argue for immediate data release on publication. I’m not going to enter that argument fully here but you do at some point have to let the data go. Better sooner than later.
If there is scope for collaboration it ought to come regardless. If there are competitors that close to the data, shouldn’t you be hiring them or initiate a collaboration with them? (Before developing the data, that is.) And if the key idea behind a new analysis is one you didn’t think of, all you’d have done by holding the data is hold up progress.
Fear of quick use of data might suggest the follow-on analysis isn’t that original, is fairly straight-forward, or that others are better placed to do it. (I have sympathies with less well-funded groups in the latter respect.)
Science as an endeavour needs the data.
Speaking for my area, in molecular biology comparative analyses have contributed hugely. These analyses rely on the data being readily available. Similarly, integrating diverse datasets has made contributions.
Perhaps don’t be the one that stymies progress, and work with people rather than against them?
* One sticking point can be not fully appreciating the experimental techniques.
The image is of “Cymothoa exigua, or the tongue-eating louse, is a parasitic crustacean of the family Cymothoidae. The parasite enters fish (here a Sand steenbras, Lithognathus mormyrus) through the gills and then attaches itself to the fish’s tongue.” Author: Marco Vinci. Source: Wikipedia. Creative Commons Attribution-Share Alike 3.0 Unported.