Friday, 21 August 2015

Bioinformatics is just like bench science and should be treated as such

A bad workman blames his tools. A bad life scientist blames bioinformatics. OK, so that’s a little unfair but so is the level of criticism levelled at bioinformatics by people who should know better. If you are a bioinformatician, it is inevitable that you will run up against the question of whether you ever do “real” science.

If you are unlucky, it will be as blunt as that. At the end of a bioinformatics seminar earlier this year, someone actually asked (in what was meant to be a good-natured way): “Is bioinformatics real? I give the same data to two different bioinformaticians and get completely different answers!” Often, it is is in the subtle form of: “are you going to validate that in the lab?” - as if validating it another way would not itself be valid.

If you are a bioinformatician and are asked a question like that, the correct answer is: it’s as real as [insert appropriate “wet” discipline of choice]. For bioinformatics is science and like all science it can be done well, or it can be done badly. It can generate meaningful results, or meaningless results.

If you want to get meaningful results, you have to treat it like a science, rather than a “black box” of magic. What do I mean by that? Here are my not-quite-buzzfeed-worthy, “8 shocking ways that bioinformatics is just like bench science”:

1. Experience and Training. You wouldn’t hand someone a wet lab protocol for, say, Southern blots, give them a key to your lab and say, “off you go” without first giving them some training and, preferably, the opportunity learn from someone who already knows the procedure. If you do, expect bad results. Bioinformatics is no different. Just because a two year old can work a computer these days, that does not mean that bioinformatics is easy. A two year can also press “Start” on a PCR machine.

2. Optimising your workflow. If you were doing a PCR, you wouldn’t just find a random paper you like that also did a PCR, copy the Mg2+ concentration etc. and then bung it in PCR machine and run the default cycle. Likewise, you should not just stick your data into a bioinformatics program and automatically expect it to do the right thing. Just as to be a good molecular biologist, you need to be (or know) someone who knows a bit of chemistry to understand what’s going on, to be a good bioinformatician, you need to be (or know) someone who knows a bit of molecular biology (and chemistry! and sometimes physics) to understand (a) the data you are putting into a program/workflow, and (b) what the best way to process that data is. If you make the wrong assumptions of your data, you will get the wrong answer. (And if different people make different assumptions, they will probably get different answers.) Computers just do what they are told - don’t blame them if you tell them to do the wrong thing. (It is also important not to get “target fixated” on perfect optimisation; just like for bench science, the performance of your bioinformatics workflow only needs to be as good as your experiment/question demands.)

3. Planning. You wouldn’t start a bench experiment without planning it first. Just because bioinformatics is not time-dependent, that doesn’t mean that you shouldn’t plan your analysis before you start. Know what your final output is going to be and work backwards. Making decisions as you go along is a great way to make bad decisions. Sure, have a play to work out how things work but then go back and do it properly from beginning to end.

4. Lab notes. You wouldn’t just stick a tube labelled “20/8/15 mouse 3 PCR” in the freezer and expect to remember what it was and how it was made 3 months later. Instead, you would (hopefully!) keep a meticulous record of primers and reaction conditions etc. in your lab book. Bioinformatics needs the same record-keeping mentality. Program version numbers, dates and settings are important. Write them down. You will almost certainly end up running an analysis more than once, and it won’t always be the last run that you end up using. You do not want publications to be held up because you are having to re-run your bioinformatics just to work out what settings you settled on.

5. Labelling. Even with a well kept lab book, you wouldn’t store samples or extracted DNA in tubes labelled “tube 1”, “tube 2” etc. for every experiment. If someone rearranges your freezer - or there is an emergency freezer swap following power/equipment failure - you could quickly get muddled up. Likewise, don’t call your files things like “sequence.fasta”. You’re just asking to accidentally analyse the wrong data. Include multiple failsafes so that if you enter the wrong directory, for example, your file names won’t be found. (A pet hate of mine is bioinformatics software that outputs the same generic file names each time it is run for lazy scripting.)

6. Reproducibility. Bioinformatics is - or should be - extremely reproducible in a trivial way. If you put the same data into the same program with the same settings, you should get the same answer. In this sense, it should have the edge over bench work - you do not need to repeat experiments. Right? Well, not really. Just because it should be consistent in how it goes wrong, you cannot be sure that a bioinformatics tool is not getting confused by some subtle nuance or peculiarity of your data. Try with another tool that does the same job, or change a setting that should make no difference, and check that you get qualitatively the same answer. (Better still, try changing a setting that should make a predictable difference and make sure that it does.) Just as you can get a misleading lab result if you mislabel your tubes or add the wrong buffer, you can get a misleading bioinformatics result if you mislabel your data or use the wrong parameter settings.

7. Validation. Bioinformatics often receives a certain amount of flak that it is not “real” and everything needs to be validated. This is true, up to a point. The forgotten point is that nothing is real and everything needs to be validated. In the lab, you rarely actually measure or observe something directly - you are inferring reality from things you can measure (e.g. fluorescence) based on what you think you know about the system (e.g. what you’ve labelled) and certain assumptions (e.g. lack of off-target binding). You then have to perform additional experiments - and/or bioinformatic analysis of your data - to test that your assumptions appear to be good and that there are not alternative explanations for your observations. Bioinformatics is no different. NO different. You make assumptions and you make inferences based on observed outputs. These assumptions and inferences need to be tested. This might be by “validation” in the lab. It might be by independent analysis of other data. The only reason the former is more common is that one often needs to generate new data, which clearly bioinformatics cannot do. However, if the data already exists, there is no reason why bioinformatics cannot be used to validate other bioinformatics, or even bench experiments.

8. Limits. Regrettably, bioinformatics is not a magic wand. (Sadly, we are not bioinformagicians.) It cannot correct poor experimental design. It cannot overcome a lack of statistical power. Just like at the bench, if you design an experiment poorly, include confounding variables or overlook covariates, you might not be measuring what you think and/or you might not have any signal from your analysis. It is tempting to think that bioinformatics is more limited that bench science because we cannot collect our own data, but this would be wrong. Bench data is the raw material on which bioinformatics is performed. We can collect new data from other data - much of sequence analysis is doing just this. Of course, if our particular study focus of interest has no data, we need to generate it. But if you want to study the affect of a certain drug on a certain cell line and either/both do not exist, you have to generate that too.

So, what can we do about it? Bioinformaticians have to take some of the responsibility, largely because we are the ones that write software that perpetuates the myth that understanding parameters is not important. What do I mean by that? Well, often the documentation or “help” for bioinformatics tools is poorly written and poorly maintained - if it exists at all. When it does exist, it is usually written with expert users in mind. The novice is flooded with parameters and does not know which ones are important, or when. One solution is to write a series of protocols in the same vein as bench protocols, highlighting when one might want to change certain parameters - and which parameters are most important. (I am no saint in this department, sadly. If nothing else, this post has made me more determined to do better.)

The bottom line is quite simple, though:

Bioinformatics is science. Full stop. It is no better than other science. It is no worse than other science. People do it right. People do it wrong. However, if you are worried that it’s not real, the chances are that either you are doing it wrong, or you have deluded yourself about the “reality” of observations from bench science.


  1. Great post, very accurate! I really like the comparison between bench and computational science in particular.

    Oh and believe me - bench science has many more conflicting results than bioinformatics, at least the NGS side of bioinformatics. Does that mean bench science is not "real" to these people ? Unbelievable.

  2. Nice thought-provoking post.

    From the perspective of a lab rat that needs to interact with bioinformaticians, I would emphasise one part of your second point:
    " be a good bioinformatician, you need to be (or know) someone who knows a bit of molecular biology (and chemistry! and sometimes physics) to understand (a) the data you are putting into a program/workflow, and (b) what the best way to process that data is...."

    It can be very scary to work with bioinformaticians who do not know or understand the chemistry (or physics or molecular biology) that underlies the data that they are given to play with and who might even be unwilling to learn enough to understand the parameters that are needed for the processing of a specific dataset. As scary as it is to work with biochemists/biologists who are unable or unwilling to understand the basics of what happens to their data (and, consequently what can go wrong) once they are handed over to a bioinformatician. If the interface between the bench science / data production and data analysis / bioinformatic s becomes gappy, the result will very often become not very useful (to phrase it mildly).

    Re-reading your article makes me think, that a next iteration of it could divide it into three different useful and interesting sections:
    - Your advice to bench scientists about how to interact / deal with bioinformaticians
    - Your advice to bioinformaticians about how to interact with the real world :-) and with bench scientists
    - Your advice to both about good practise in bioinformatics and data analysis

    1. Hi Achim! Yes, I totally agree. A bioinformatician who doesn't care about the data/results is as bad (or worse!) as a biologist who doesn't care about methods. Both bioinformaticians and bench scientists need to respect bioinformatics as serious science. I'll save the rant at bioinformaticians who completely ignore the real (biological) world for another day but your section requests might make some interesting follow up posts!

  3. I would like to suggest that you read my short piece in Human Mutation (34;1581-2). Bioinformaticists like create scoring systems but don't fully test them, their distributions are not defined, and they often have no relationship to physical measurement (or units). This does a disservice to the field. I have found (as editor) that biologists will misuse these systems to make or disprove a point, when they have not been subjected to the same criteria that experimental data have.

    1. I think you will find that some biologists will cherry pick and misuse bad/untested/preliminary/unreproduced experimental data too. Bioinformatics is science. It is not immune to bad science. Bioinformaticians are people. Some of them do bad things or make mistakes. I don't really understand the point you are trying to make unless it is: biologists (of all kinds), please don't do bad science. (But it's not the fault of bioinformaticians - and certainly not the whole field of bioinformatics - if someone misuses their work. If a scoring system is crap or untested, don't use it!)

    2. Would you use a cell line that had not been tested or defined? I think you are just proving my point that failing to treat bioinformatics like bench science leads to problems. Editors and reviewers are at the front line of enforcing this.


Thanks for leaving a comment! (Unless you're a spammer, in which case please stop - I am only going to delete it. You are just wasting your time and mine.)