Retrospective–The mythology of bioinformatics

By Grant Jacobs 09/12/2009

The earlier period of eukaryote genome projects mythologised bioinformatics, viewing it as predominantly data management and IT-based methodology. Following from my previous post on this topic, I show my thoughts on this seven years ago.

Below I have republished an article I wrote in 2002, a year after I started my freelance consultancy (BioinfoTools), looking at how at that time I considered bioinformatics–my research field–was viewed in the earlier stage of the genome projects. I’d like to present my views at that time and later review I see things now, seven years on. The text below is as published in 2002, very lightly edited. A small number of extra examples and a few asides have been added for clarity; I’ve indicated these by enclosing them by square brackets. I haven’t altered the dates or time scales, so you’ll have to bear in mind this was written seven years ago! You will also want to bear in mind that there was considerable hype about bioinformatics then, that genomes and bioinformatics were going to “solve everything” in a way that’s perhaps hard to imagine now. The references to ‘rumination’ are to that this article was published under a “Ruminations” column. I haven’t provided links to all the books and papers, as it’d simply take too long.

Bioinformatics is a much hyped, mythologised discipline. It isn’t that people are actively trying to mythologize it, nor that everyone has these views. Its just that as many new people join in from technological, management and business backgrounds, the view of bioinformatics they appear to have differ from its original foundation in theoretical or ‘first principles’ biology. Even some biological researchers seem to share this view.

To see past these myths one needs to peer into bioinformatics’ past and view its progress since its beginnings. I’d like to explore how much of the current view of bioinformatics differs from the actual origins of the field and how this might affect bioinformatics in the immediate future. In some ways this rumination might be better titled “Is technology taking over in bioinformatics (at the expense of theoretical biology)”?

Having been trained by one of the early bioinformatics scientists (bioinformaticians?) and having studied in the area for around 10 years now, I believe I have a fairly useful perspective on where bioinformatics came from, roughly how it has progressed and, from this, a perspective of where it might be headed next.

Myth 1: Bioinformatics has arisen in the last 5-10 years

“Bioinformatics is a new science, which arose in the last 5 to 10 years or so”. We’ve all heard phrases along these lines in seminar and conference talk introductions and by groups trying to persuade their powers-that-be to fund them. True? If not, how does is it likely to affect bioinformatics immediate future?

Let’s break this statement into two parts: the “newness” and the exact age. If you view science on the grand scale of hundreds or thousands of years, bioinformatics could hardly be anything but new. However, if you compare its age with the sciences it partners, particularly molecular biology–itself only a few decades old–you might be surprised to find its been around for a fair while. Bioinformatics “in science” (but not in name) began in the late 1960’s – early 1970’s, which we’ll look at in more detail below.

On a more personal note, I feel remarks proporting bioinformatics to have very recent origins must be rather galling to the pioneers of the field, most of who have worked in bioinformatics all their careers and have since retired (at least officially!). I feel these remarks are a reflection of the hype over the last 5-10 years. We ought to give the early workers the credit they deserve and understand better where bioinformatics has come from so that we might better understand what we are doing.

Depending on where you draw the line, bioinformatics has been around since the late 1960s – early 1970s and certainly was established by the 1980s. I can hardly claim to know of all the early workers, but below I list enough to satisfy sceptics that the field was in fact active. Don’t feel offended if your favourite star is missing; this list would be very long if I included everyone! Early researchers of the late 1960s – early 1970s era include Margaret Dayhoff, Russell Doolittle, George Rose, Michael Levitt, and Andrew McLachlan (I must admit my bias here: Andrew was my Ph.D. supervisor). Somewhat later contributors from the 1970s onwards include Joe Felenstein (phylogenetics), Michael Waterman (sequence analysis algorithm development), Temple Smith (sequence analysis methods), Cyrus Chothia (analysis of protein sequences and structures), Drs. Chou, Fasman and Robson (of secondary structure prediction fame), Walter Fitch (RNA structure prediction), V. I. Lim (organization of protein structures and secondary structure prediction), Needleman and Wunsch (sequence comparison and searching), Roger Staden (sequence analysis), David & Jane Richardson (protein structure). And on the list goes…

The first sequence database is surprisingly old. Margaret Dayhoff founded the Protein Identification Resource in the mid-1960s. This far-sighted move was the first of the sequence databases. Initially it was published in printed paper form [which I remember] as the famous blue-covered “Atlas”, it later evolved into the PIR sequence database. With her colleagues she detected early examples of conserved protein sequence motifs [e.g. the catalytic triad motif].

There are bioinformatics text books over ten years old whose bibliographies are testimony to the busy activity of bioinformatics research in the 1980s. Sitting on my shelves are well-worn copies of Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit (von Heijne, 1987) and Nucleic acid and protein sequence analysis: a practical approach (ed. Bishop & Rawlings, 1987). [For some reason I left out an old favourite: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (Sankoff and Kruskal, 1983), possibly because I lost my copy shifting from Cambridge; I’ve since bought the re-printed edition whose cover is shown below. Yes, it’s plain white on black!]

The journal CABIOS (Computer Applications in the Biological Sciences), which has since become Bioinformatics, has been around since 1985. For those interested in finding early papers, much of the early literature in bioinformatics was published in JMB (Journal of Molecular Biology), NAR (Nucleic Acids Research) and to a lesser extent PNAS USA (Proceedings National Academy of Sciences USA), which are still strong publishers of this field. (Or at any rate, these are the journals I taught myself bioinformatics from.)

As an aside, recently I came across a paper in Science in 1986 by R Lewin entitled “The DNA databases are swamped”! Those of you who are familiar with the rapid increase in volume in these databases in the late 1990s will find claim this quite amusing. If only Dr. Lewin could have forseen the genome projects!

So, it is clear that bioinformatics itself has not been “created” over the last 10 years. So what has been created over the past 10 years? I’d suggest:

  • the (largely) commercially-driven hype and demand for bioinformatics; bioinformatics itself has been around longer
  • possibly, establishment of the moniker “bioinformatics” as the name for the field (this may be, in part, related to the first item)
  • a marketing effect where the new part of the field (the technology and information science-driven components) have been hyped as the essence of the field

Myth 2: Bioinformatics is biology + computing

[Almost, but not quite; this point may be somewhat subtle to outsiders, but it’s important.]

A linguist might argue that bioinformatics is, strictly speaking, a field which restricts the application of informatics to biology. Informatics in its narrowest sense is about manipulation of data, without necessarily understanding the meaning of that data being manipulated; its about the computer methods used to manipulate the data, not the data themselves. Under this definition, Bioinformatics would be primarily about bringing existing or modified variants of existing informatics methods to biological applications. These methods in many (most?) cases do not in themselves have much, or indeed any, knowledge of biological principles.

By contrast, early bioinformatics work was almost invariably founded on biological concepts from the onset. A biological issue was raised and then a technique to address that issue was presented. That is, theoretical biology was the foundation on which [early] bioinformatics was built. I fear this is being lost in the mass-data and technology-hype driven bioinformatics. It seems to me that unless companies and research groups are careful many will waste time and money “stamp collecting and cataloging”. Certainly the organized data is useful, but only if it is applied with biological principles.

I am not saying that new technology is not useful–it is–but that it is not the whole picture. I am also not saying that all bioinformatics has stopped having a biological focus–far from it!–but many new-comers [or biologists not familiar with the field at first hand] seem to see mainly the new hyped large-data and technologies issues. They appear not to see the less trendy theoretical biology-based work as being relevant to them.

One reason for this “first principles” orientation of early bioinformatics is that many of early biologists were emigres from “hard” disciplines, in particular physics and chemistry, along with a few mathematicians. These folk were used to fields with underlying layers of principles upon which further work could be built. In addition, molecular biology was still trying to establish the basic understanding of itself, as it were, which encouraged these people to assist this venture.

Since then we have seen emerge a whole generation of molecular biologists who were, on the whole, comparatively ignorant of the theoretical (bio)chemical and (bio)physical underpinnings of molecular biology (compared to early researchers, who were frequently “true” chemists or physicists in they own right). This has lead to modern biology until fairly recently being guilty to some extent of not deriving further underlying principles from the data generated from experimental studies. The large amount of data being generated at present would, I’d like to think, bring us back to the need to raise the level of “first principles” understanding of that data.

Some might argue that this early style of work is better labelled “computational biology”, a term I favour myself (biology, using computers as the tool as opposed to general informatics on data which happens to be biological). While perhaps elegant, pidgeon-holing like this would only serve to further divorce what I believe ought to be the underlying layer of all bioinformatics ventures.

Theoretical biology is (or should be) the language of communication amongst the players in bioinformatics teams. And certainly at least the group leaders should have a theoretical biology foundation to ensure that real biological science results at the end of the day.

Skjomen bridge
Skjomen bridge

I imagine bioinformatics as being a bridge, with biology on one side, computing, statistics, etc. (as the toolkits) on the other upheld by theoretical biology acting as a bridge pile enabling communication between the two sides. Without the pile, the bridge has a rather long single span and is liable to collapse.

By omitting theoretical biology and retaining just biology + computing (or statistics or whatever it might be), one is asking for a superman-like leap with a single bound to be taken. That somehow the “tool component” (the computing, etc.) is supposed to magically wave its wand and suddenly solve previously difficult biological problems. I have serious trouble with this idea. The problems are biological problems after all: no amount of clever computing is going to remove the biology unless there are biological principles behind it. More than just databases and high-powered computers are needed.

Put another way, all fields have their underlying disciplines: I worry that with all the focus on technology, bioinformatics is in danger of forgetting that its underlying layer is theoretical biology. It doesn’t sound as trendy as bioinformatics, but it is essential. Chemists and physicists rarely ignore their theoretical components; they look to them for answers.

Most biologists, group leaders, managers and CEOs seem to swallow the hyped technology-based bioinformatics with ease. Gloop. Down it goes. I wonder how many see that theoretical biology lies under most (all?) good bioinformatics?

Part of the problem no doubt lies with the over-exercise of the point that there is large amounts of data and that this needs new methods. While this may be true (to an extent – other sciences have far worse data problems), this ought not to be done at the expense of discarding the underlying theoretical biology.

As this rumination has already gotten on beyond a reasonable length, I’ll explore in another article how bioinformatics workers, developers, teachers or users, can help themselves by attempting to explain their work in purely biological terms. If you can’t do this, you very likely do not know what you are doing and may well be doing something entirely inappropriate!

Really this is a better bridge to convey my analogy, in that it has a central pile; I just like the other photograph better… who said scientists do everything without emotion? 😉 Actually I do have a silly nitpick: as a photography buff, I can’t stand elements disappearing off the edge without “support” like the right-hand portion of the bridge does, it’s poor composition! So there!

Mid Hudson Bridge

© Grant Jacobs, BioinfoTools (2002-)

Other posts on bioinformatics on Code for life:

Bioinformatics — computing with biotechnology and molecular biology data

Computational biology: Natural history v. explanatory models

0 Responses to “Retrospective–The mythology of bioinformatics”

  • Some people around the internet seem to be think that I was saying that “data organisation is a waste of time/money”, or the same of data collecting, or similar sentiments.

    Allow me to put this right: I wasn’t. (Past tense: this article written seven years ago!)

    I wrote:

    It seems to me that unless companies and research groups are careful many will waste time and money “stamp collecting and cataloging”. Certainly the organized data is useful, but only if it is applied with biological principles.

    Here I was looking forward, from seven years ago, saying that it seemed to me that rushing in or overdoing it will result in waste, in more junk than I think I thought was healthy. (It needs to be remembered that there was huge hype in genomics then; these words lie in their historic context, as it were.) I was not saying that all “fishing trip” data would all be useless. I wrote “organized data is useful” in part to make this distinction clear.

    At the time I was concerned that considerable effort would be spent collecting data that would ultimately prove to be of no or little use because not enough attention was being paid to the biological principles aspects that were needed to make good of it (and quality, although this wasn’t a focus of this article).

    This, of course, falls into a larger context of data relevance, context, completeness, accuracy, controls, replication, testing, etc; more than I have space (and time) to deal with here.

    Also, I did not write “data organisation”, but “organized data”.

    The former is the process making order of data or the structure in which the data is organised (depending on the context). These are not about the data, or biology, itself but data management.

    The latter—what I wrote—was referring to the data itself, in particular large datasets that might loosely be thought of as being collected in the hope of later being useful. I wasn’t arguing that this kind of data was inherently of no use, but that what makes this data useful, isn’t (for the most part) the data in and of itself, but that you later are able to apply biological principles to it to extract meaning from it (implying that this needed to be borne in mind from the onset).

    I was well aware at that time that there were challenges in managing data, but also this fell more within bio-IT (a label I only stumbled onto later) than bioinformatics or computational biology. (My previous post also touches on this distinction; see the link in the opening paragraph). A argument in what I wrote was that bioinformatics shouldn’t be perceived as being only or mainly “about” these bio-IT issues, which I considered were in danger of obscuring underlying knowledge that made the field yield useful results in it’s own right.

  • Judging by comments elsewhere on the internet, a small number of people appear to be reading this the wrong way. I suppose it’s inevitable that a few will. Below I offer a few pointers/reminders that might help.

    Before I get on to these, I encourage people to ask or discuss their thoughts if their opinions differ or they are confused as to what is being said.

    A few pointers/reminders:

    – This post was written in 2002. I’ve dug it up in the hope that this retrospective might be interesting and in the hope that I might later give a present-day take on my views that I can compare it with.

    – At that time genomics and bioinformatics was hugely hyped. I emphasise this because I suspect that a number of younger commenters don’t have a full appreciation of the fuss made over the genome and bioinformatics at that time. As I wrote “genomes and bioinformatics were going to “solve everything” in a way that’s perhaps hard to imagine now”.

    Readers need to be aware that it was written from that time. There are many things that are a reflection of that time that will not be entirely obvious if you weren’t working as a research biologist or bioinformatics / computational biology scientist then. One, for example, was a shortage of formally-trained bioinformatics workers with the result that people were recruited from other fields, many of who didn’t have biology backgrounds, nor knew of the longer basis of the field. (I’ve nothing against these people, it’s just an observation.) There was (the beginnings of) a shift to large projects and commercial ventures. And so on. This article needs to be read in this context, it is “of it’s time”.

    – I was not slating bioinformatics! I was commenting on a perception of bioinformatics by those outside the field and those new to it that didn’t know it’s past and how this was impacting on how it was viewed. (A perception that I feel has in many ways lingered amongst those outside the field.)

    – Likewise, I was most definitely not “tarring” bioinformaticians or anyone else. (The people, not the field.) I’ve never been interested “hitting on” people. I do care about how this field is perceived, especially if the perception of it isn’t helpful to the larger endeavour. It’s my field, after all. If people understood my route to entering this field, they’d understand this better, but I’ll avoid the trap of being sounding off on my past… 😉

    – It may help some to also read the article I linked in the initial pre-amble paragraph (and in the footer), which takes a wider view on this topic and was written less than a year later for a small local scientific journal, NZ BioScience.

  • This comment about a failure to cite older literature in a research article might serve as an example (of sorts) of younger researchers missing the pre-1990’s science. (Decide for yourself, I’m just pointing it out!)

    There is an argument amongst some that there is no need to refer to the “original” method, but only the one in use in the paper at hand. I don’t want to enter that debate here.

  • i can see the effects of bioinformatics , specailly at the psychological level in my own lab. these days biology is heavily becoming depedent on collective thinking of science community as a whole , empowered by virtual networks, multidiciplinary technological advancments and corporative managment. individualsim is basically dying in modern science. new approaches , using state of art technology , marketing strategies are seriously taking over. in our own lab , the dominant topic is about next generation sequencing and writing computer scripts.and indeed this is a dog eat dog world, you either join or you intellectually rot. ou older professors with lots of intellectual achivments from the past are starting to feel insecure and holding a defensvie approach. the younger ones (including myself) try to embrace the new world order!technicians are nervous whether they can keep their jobs and bioinfomaticians well they are the hot shots today! but i think things will change , within couple of years scientific community will be drowned in its own mud of informations, i think the future belongs to those who design and implement new realms of functional approaches, capable of control , define, direct and manipulate this vast ocean of knowleadge , and the majority of bioinfomaticians will be the working ants of this realm.

  • Hi Christos,

    If you look in the ‘Other posts’ section immediately after the article and click on the first link you’ll see the first reference cited there is familiar. As you can see from this I was aware of your paper and certainly haven’t “missed” it. There are others of similar ilk, too. On that note, you’ll see that elsewhere in my blog I have referred to the roots of bioinformatics articles the journal Bioinformatics ran.

    As I (sort-of!) indicated in the introduction, I didn’t add references or links in this articles as I was too short on time at the time I ‘ported’ this article to the blog. (Unlike the article I just referred to, the ‘Mythology of Bioinformatics’ article was never prepared to formal publication, let alone published.)