Sunday, 27 March 2016

Are data scientists just "research parasites"?

Although it passed me by at the time, the New England Journal of Medicine - a highly respected top-tier medical journal - featured an editorial on data sharing1 in January. It was so bad, that the International Society for Computational Biology (ISCB) felt the need to respond in the most recent issue of PLoS Computational Biology2. I’m glad they did, for the editorial was awful.

It starts quite well:

The aerial view of the concept of data sharing is beautiful. What could be better than having high-quality information carefully reexamined for the possibility that new nuggets of useful data are lying there, previously unseen? The potential for leveraging existing results for even more benefit pays appropriate increased tribute to the patients who put themselves at risk to generate the data. The moral imperative to honor their collective sacrifice is the trump card that takes this trick.

But then rapidly goes downhill:

However, many of us who have actually conducted clinical research, managed clinical studies and data collection and analysis, and curated data sets have concerns about the details. The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters. Special problems arise if data are to be combined from independent studies and considered comparable. How heterogeneous were the study populations? Were the eligibility criteria the same? Can it be assumed that the differences in study populations, data collection and analysis, and treatments, both protocol-specified and unspecified, can be ignored?

Many of us who have actually conducted data analysis would retort: if you have concerns about the details then you should be making those details clear. If choices are important, explain them! For sure, you cannot just blindly combine multiple datasets that have different biases etc. but what decent scientist would do that (without an explicit caveat regarding that assumption)?

Longo and Drazen seem to be implying that all data scientists are bad scientists. As I’ve said before, Bioinformatics is just like bench science and should be treated as such. If you are making dodgy assumptions about data, you are doing it wrong. (Though people do make mistakes - the data collectors too.)

It gets worse:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

Apparently, some people might think I am a “research parasite” because I sometimes analyse other people’s (published) data without talking to them about it. I’m glad the ISCB called them out on this. Newsflash: science only makes progress by people trying to disprove what other researchers (and, ideally, themselves) have posited. Science is a shared endeavour. If someone uses your data to do something (good), good! If you don’t want that, embargo the data or delay publication. Then question your motives; if glory is what you seek, perhaps you’re in the wrong profession?

A researcher frightened of “stolen productivity” is perhaps a researcher struggling for ideas. (I’d love someone else to answer some of the questions I have kicking around so that I could move on to the next thing!) A researcher scared of someone trying “to disprove what the original investigators had posited” has bigger problems.

The rest of the editorial is not so bad, as it tells the tale of a fruitful collaboration between “new investigators” and “the investigators holding the data”. Of course, this is the ideal scenario, short of generating the data themselves. The fact that the authors felt the need to stress this - and the language used of “symbiosis” versus “parasitism” - demonstrates that Longo and Drazen are utterly clueless about the modus operandi of the disciplines they discredit. Whilst ideal, direct collaboration is not always feasible. Sometimes - when the original investigators are too attached to their pet hypothesis or conclusion - it is not desirable.

They end:

How would data sharing work best? We think it should happen symbiotically, not parasitically. Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant co-authorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested. What is learned may be beautiful even when seen from close up.

This sounds OK - and the described model may even be data sharing at its best - but the implication that anything short of this ideal is somehow inadequate is naive and unhelpful.

First, one person’s novel idea is another person’s obvious extension. And anyway, why should having one idea give you automatic rights to all obvious extensions?! Why should the rest of us trust the data gatherers to do a good job - especially if they exhibit attitudes towards data akin to these authors?

Second, identifying a potential collaborator does not guarantee collaboration. Ironically, the kind of paranoid narcissist that would use a term like “research parasite” is unlikely to be open to collaboration.

Thirdly, citation is a form of co-authorship that acknowledges “the investigative group that accrued the data”. Wanting full co-authorship where additional intellectual input is not required is just greedy. (And a note to the narcissist: self-citations are generally seen as lower impact than citations by wholly independent groups.)

Longo and Drazen should stick to commenting on what they know, whatever that is, and leave data scientists to worry about how they conduct themselves. With this editorial, they have done everyone - not least of which themselves - a deep disservice.


  1. Longo D.L., Drazen J.M. Data Sharing. N Engl J Med, 2016. 374(3): p. 276–7. doi:10.1056/NEJMe1516564.

  2. Berger B, Gaasterland T, Lengauer T, Orengo C, Gaeta B, Markel S, et al. (2016) ISCB’s Initial Reaction to The New England Journal of Medicine Editorial on Data Sharing. PLoS Comput Biol 12(3): e1004816. doi:10.1371/journal.pcbi.1004816.

No comments:

Post a Comment

Thanks for leaving a comment!