The earlier period of eukaryote genome projects mythologised bioinformatics, viewing it as predominantly data management and IT-based methodology. Following from my previous post on this topic, I show my thoughts on this seven years ago.

Below I have republished an article I wrote in 2002, a year after I started my freelance consultancy (BioinfoTools), looking at how at that time I considered bioinformatics–my research field–was viewed in the earlier stage of the genome projects. I’d like to present my views at that time and later review I see things now, seven years on. The text below is as published in 2002, very lightly edited. A small number of extra examples and a few asides have been added for clarity; I’ve indicated these by enclosing them by square brackets. I haven’t altered the dates or time scales, so you’ll have to bear in mind this was written seven years ago! You will also want to bear in mind that there was considerable hype about bioinformatics then, that genomes and bioinformatics were going to “solve everything” in a way that’s perhaps hard to imagine now. The references to ‘rumination’ are to that this article was published under a “Ruminations” column. I haven’t provided links to all the books and papers, as it’d simply take too long.

Bioinformatics is a much hyped, mythologised discipline. It isn’t that people are actively trying to mythologize it, nor that everyone has these views. Its just that as many new people join in from technological, management and business backgrounds, the view of bioinformatics they appear to have differ from its original foundation in theoretical or ‘first principles’ biology. Even some biological researchers seem to share this view.

To see past these myths one needs to peer into bioinformatics’ past and view its progress since its beginnings. I’d like to explore how much of the current view of bioinformatics differs from the actual origins of the field and how this might affect bioinformatics in the immediate future. In some ways this rumination might be better titled “Is technology taking over in bioinformatics (at the expense of theoretical biology)”?

Having been trained by one of the early bioinformatics scientists (bioinformaticians?) and having studied in the area for around 10 years now, I believe I have a fairly useful perspective on where bioinformatics came from, roughly how it has progressed and, from this, a perspective of where it might be headed next.

Myth 1: Bioinformatics has arisen in the last 5-10 years

“Bioinformatics is a new science, which arose in the last 5 to 10 years or so”. We’ve all heard phrases along these lines in seminar and conference talk introductions and by groups trying to persuade their powers-that-be to fund them. True? If not, how does is it likely to affect bioinformatics immediate future?

Let’s break this statement into two parts: the “newness” and the exact age. If you view science on the grand scale of hundreds or thousands of years, bioinformatics could hardly be anything but new. However, if you compare its age with the sciences it partners, particularly molecular biology–itself only a few decades old–you might be surprised to find its been around for a fair while. Bioinformatics “in science” (but not in name) began in the late 1960′s – early 1970′s, which we’ll look at in more detail below.

On a more personal note, I feel remarks proporting bioinformatics to have very recent origins must be rather galling to the pioneers of the field, most of who have worked in bioinformatics all their careers and have since retired (at least officially!). I feel these remarks are a reflection of the hype over the last 5-10 years. We ought to give the early workers the credit they deserve and understand better where bioinformatics has come from so that we might better understand what we are doing.

Depending on where you draw the line, bioinformatics has been around since the late 1960s – early 1970s and certainly was established by the 1980s. I can hardly claim to know of all the early workers, but below I list enough to satisfy sceptics that the field was in fact active. Don’t feel offended if your favourite star is missing; this list would be very long if I included everyone! Early researchers of the late 1960s – early 1970s era include Margaret Dayhoff, Russell Doolittle, George Rose, Michael Levitt, and Andrew McLachlan (I must admit my bias here: Andrew was my Ph.D. supervisor). Somewhat later contributors from the 1970s onwards include Joe Felenstein (phylogenetics), Michael Waterman (sequence analysis algorithm development), Temple Smith (sequence analysis methods), Cyrus Chothia (analysis of protein sequences and structures), Drs. Chou, Fasman and Robson (of secondary structure prediction fame), Walter Fitch (RNA structure prediction), V. I. Lim (organization of protein structures and secondary structure prediction), Needleman and Wunsch (sequence comparison and searching), Roger Staden (sequence analysis), David & Jane Richardson (protein structure). And on the list goes…

The first sequence database is surprisingly old. Margaret Dayhoff founded the Protein Identification Resource in the mid-1960s. This far-sighted move was the first of the sequence databases. Initially it was published in printed paper form [which I remember] as the famous blue-covered “Atlas”, it later evolved into the PIR sequence database. With her colleagues she detected early examples of conserved protein sequence motifs [e.g. the catalytic triad motif].

There are bioinformatics text books over ten years old whose bibliographies are testimony to the busy activity of bioinformatics research in the 1980s. Sitting on my shelves are well-worn copies of Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit (von Heijne, 1987) and Nucleic acid and protein sequence analysis: a practical approach (ed. Bishop & Rawlings, 1987). [For some reason I left out an old favourite: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (Sankoff and Kruskal, 1983), possibly because I lost my copy shifting from Cambridge; I've since bought the re-printed edition whose cover is shown below. Yes, it's plain white on black!]

The journal CABIOS (Computer Applications in the Biological Sciences), which has since become Bioinformatics, has been around since 1985. For those interested in finding early papers, much of the early literature in bioinformatics was published in JMB (Journal of Molecular Biology), NAR (Nucleic Acids Research) and to a lesser extent PNAS USA (Proceedings National Academy of Sciences USA), which are still strong publishers of this field. (Or at any rate, these are the journals I taught myself bioinformatics from.)

As an aside, recently I came across a paper in Science in 1986 by R Lewin entitled “The DNA databases are swamped”! Those of you who are familiar with the rapid increase in volume in these databases in the late 1990s will find claim this quite amusing. If only Dr. Lewin could have forseen the genome projects!

So, it is clear that bioinformatics itself has not been “created” over the last 10 years. So what has been created over the past 10 years? I’d suggest:

  • the (largely) commercially-driven hype and demand for bioinformatics; bioinformatics itself has been around longer
  • possibly, establishment of the moniker “bioinformatics” as the name for the field (this may be, in part, related to the first item)
  • a marketing effect where the new part of the field (the technology and information science-driven components) have been hyped as the essence of the field

Myth 2: Bioinformatics is biology + computing

[Almost, but not quite; this point may be somewhat subtle to outsiders, but it's important.]

A linguist might argue that bioinformatics is, strictly speaking, a field which restricts the application of informatics to biology. Informatics in its narrowest sense is about manipulation of data, without necessarily understanding the meaning of that data being manipulated; its about the computer methods used to manipulate the data, not the data themselves. Under this definition, Bioinformatics would be primarily about bringing existing or modified variants of existing informatics methods to biological applications. These methods in many (most?) cases do not in themselves have much, or indeed any, knowledge of biological principles.

By contrast, early bioinformatics work was almost invariably founded on biological concepts from the onset. A biological issue was raised and then a technique to address that issue was presented. That is, theoretical biology was the foundation on which [early] bioinformatics was built. I fear this is being lost in the mass-data and technology-hype driven bioinformatics. It seems to me that unless companies and research groups are careful many will waste time and money “stamp collecting and cataloging”. Certainly the organized data is useful, but only if it is applied with biological principles.

I am not saying that new technology is not useful–it is–but that it is not the whole picture. I am also not saying that all bioinformatics has stopped having a biological focus–far from it!–but many new-comers [or biologists not familiar with the field at first hand] seem to see mainly the new hyped large-data and technologies issues. They appear not to see the less trendy theoretical biology-based work as being relevant to them.

One reason for this “first principles” orientation of early bioinformatics is that many of early biologists were emigres from “hard” disciplines, in particular physics and chemistry, along with a few mathematicians. These folk were used to fields with underlying layers of principles upon which further work could be built. In addition, molecular biology was still trying to establish the basic understanding of itself, as it were, which encouraged these people to assist this venture.

Since then we have seen emerge a whole generation of molecular biologists who were, on the whole, comparatively ignorant of the theoretical (bio)chemical and (bio)physical underpinnings of molecular biology (compared to early researchers, who were frequently “true” chemists or physicists in they own right). This has lead to modern biology until fairly recently being guilty to some extent of not deriving further underlying principles from the data generated from experimental studies. The large amount of data being generated at present would, I’d like to think, bring us back to the need to raise the level of “first principles” understanding of that data.

Some might argue that this early style of work is better labelled “computational biology”, a term I favour myself (biology, using computers as the tool as opposed to general informatics on data which happens to be biological). While perhaps elegant, pidgeon-holing like this would only serve to further divorce what I believe ought to be the underlying layer of all bioinformatics ventures.

Theoretical biology is (or should be) the language of communication amongst the players in bioinformatics teams. And certainly at least the group leaders should have a theoretical biology foundation to ensure that real biological science results at the end of the day.

Skjomen bridge

Skjomen bridge

I imagine bioinformatics as being a bridge, with biology on one side, computing, statistics, etc. (as the toolkits) on the other upheld by theoretical biology acting as a bridge pile enabling communication between the two sides. Without the pile, the bridge has a rather long single span and is liable to collapse.

By omitting theoretical biology and retaining just biology + computing (or statistics or whatever it might be), one is asking for a superman-like leap with a single bound to be taken. That somehow the “tool component” (the computing, etc.) is supposed to magically wave its wand and suddenly solve previously difficult biological problems. I have serious trouble with this idea. The problems are biological problems after all: no amount of clever computing is going to remove the biology unless there are biological principles behind it. More than just databases and high-powered computers are needed.

Put another way, all fields have their underlying disciplines: I worry that with all the focus on technology, bioinformatics is in danger of forgetting that its underlying layer is theoretical biology. It doesn’t sound as trendy as bioinformatics, but it is essential. Chemists and physicists rarely ignore their theoretical components; they look to them for answers.

Most biologists, group leaders, managers and CEOs seem to swallow the hyped technology-based bioinformatics with ease. Gloop. Down it goes. I wonder how many see that theoretical biology lies under most (all?) good bioinformatics?

Part of the problem no doubt lies with the over-exercise of the point that there is large amounts of data and that this needs new methods. While this may be true (to an extent – other sciences have far worse data problems), this ought not to be done at the expense of discarding the underlying theoretical biology.

As this rumination has already gotten on beyond a reasonable length, I’ll explore in another article how bioinformatics workers, developers, teachers or users, can help themselves by attempting to explain their work in purely biological terms. If you can’t do this, you very likely do not know what you are doing and may well be doing something entirely inappropriate!

Really this is a better bridge to convey my analogy, in that it has a central pile; I just like the other photograph better… who said scientists do everything without emotion? ;-) Actually I do have a silly nitpick: as a photography buff, I can’t stand elements disappearing off the edge without “support” like the right-hand portion of the bridge does, it’s poor composition! So there!

Mid Hudson Bridge

© Grant Jacobs, BioinfoTools (2002-)

Other posts on bioinformatics on Code for life:

Bioinformatics — computing with biotechnology and molecular biology data

Computational biology: Natural history v. explanatory models