Choosing an algorithm – benchmarking bioinformatics

By Grant Jacobs 28/09/2010

You’ve been asked to to run a bioinformatics analysis. How do you choose what algorithm to use?

My first suggestion would be to talk to experienced bioinformatics scientists or computational biologists. It’s a lot quicker, and it’ll save you making mistakes that the research literature assumes you know better not to. You’ll also avoid the trap of thinking that something is simple that in fact isn’t. In particular, you should talk to your local analytical experts well before collecting the data, ideally before drafting the grant application.

Structural alignment of thioredoxin (Source: Wikimedia Commons)
Structural alignment of thioredoxin (Source: Wikimedia Commons)

Putting that aside, let’s look at the case of you choosing an algorithm yourself. (Perhaps you are the local bioinformatics support person or you are a biologist with one fairly simple, specific thing to do.)

For most common tasks, there are several possible algorithms you could use. Which one is best for your specific problem?

The first thing you need to do is to define the biological problem you want the analysis to resolve accurately. Aside from ensuring that you are actually trying to solve a specific problem and not ’just doing something’, you can’t possibly make an informed decision on what algorithm to use without this. (Don’t laugh at the first point, I’ve seen plenty of ’bioinformatics’ figures in lesser papers that seem to just be filling space.)

(Related to this is a common problem I get as a consultant, of people coming to me having tried to help by working what want analysis they want done. I have to politely encourage them to take a step backwards and explain to me the biological question that is to be addressed. I can’t recommend what is appropriate for them without this information either.)

You’ve done this and can identify the general class of algorithm you need. A DNA-search. Phylogenetic trees. Multiple sequence alignment.

Evolution of nitrogen fixation (Source: Washington University in St. Louis.)
Evolution of nitrogen fixation (Source: Washington University in St. Louis.)

To your horror there is a least a dozen methods to choose from, and you’re not an expert in them. (Which is where the sensible thing to do is talk to someone who is, but let’s continue.)

You could just pick the most popular, but that’s lousy science. It brings up that fallacy: Popularity does not mean effectiveness or sensibility in a more niche, domain-specific, way. But people do it. They plug for ClustalW because it’s the multiple sequence alignment method they’ve heard of.

This, of course, is where knowing your specific problem comes in. You need to match that with the algorithms on offer. Which of them addresses your specific problem best?

At this point benchmarks should enter from stage left. (Or right, if your script calls for that.)

Standardised benchmarks allow you to compare one algorithm with another.

Benchmarks feel like one of those underrated things to me. They’re essential for progress in developing better bioinformatics algorithms: they provide a yardstick for developers to measure their efforts to improve the methods. They’re essential for users to choose the current (at the time benchmarked) method best suited to their problem.

Portion of a protein sequence alignment (Source: EMBL Sequence analysis course)
Portion of a protein sequence alignment (Source: EMBL Sequence analysis course)

Good benchmarks are vital, yet I suspect for most a paper presenting a new benchmark probably isn’t getting the kudos it deserves.

They need to be carefully constructed to be appropriate and useful, and when appropriate and useful are valuable. (And when not can be misleading.)

Aniba and colleagues recently offered a discussion paper on benchmarks, mainly focusing on multiple protein sequence alignment, but with general thoughts that should be useful to newcomers to computational biology. I’m going to look at benchmarks from a users’ point of view here. (I may revisit the topic from a developers’ point of view at a later date.)

Their article includes a brief survey of benchmarks in a number of areas, e.g. multiple sequence alignment of proteins, RNA and DNA, protein structure prediction (in it’s different forms and objectives), protein docking (protein-protein interactions, or binding substrates), protein function annotation, gene expression analysis and image analysis. There are many other applications not mentioned: their list was not intended to be exhaustive, for example gene finding, calculating the pI of a protein, and so on.

I’m going to write generally in the hope that my loose thoughts on this topic might encourage others to think harder about benchmarks and their utility. I am not going to focus on protein alignment, despite it being an area of interest, as I fear it will be too narrow an interest to my readers. Speak up if this is not the case! Also, this isn’t intended to be a definitive account, more a starting point for thought. If you have further ideas, speak up too.

A critical point for someone using a benchmark is that they need to be aware of what the benchmarks measure. This might seem trite, and something of a catch-22, but it’s important. Looking at this quickly there are two main sets of things to be aware of: how benchmarks are constructed – so that you might know what to look for in a good benchmark – and the particular data and tasks the benchmark measures (and how these relate to the problem you want to solve).

What to look for in a benchmark set

Aniba and colleagues lay out several criteria for benchmarks. I have paraphrased a key element of each of these below, adding my own thoughts where I have some to offer.

Users need to have some appreciation of these so that they might make properly-informed decisions based on the benchmarks. (Ideally benchmarks present these criteria below for each task, so that users can make good use of them.)

  • Relevance What specific tasks does the benchmark reflect? How realistic are they? (e.g. are the data ‘real-world’ gold-standards or artificial datasets).
  • Solvability Aniba et al. describe this in this terms of how easy or hard the cases are for algorithms to solve. I personally prefer a scaled series of cases with some trivial, scaling up to cases that are (at the time) considered very challenging.
  • Scalability How comprehensive is the dataset the benchmark presents. Smaller datasets are more useful during software development but may not be sufficient to solidly compare different methods. (Another kind of scalability measures performance rather than quality, presenting different sizes of datasets and measuring execution variables such as time elapsed, storage used, and so on.)
  • Accessibility The benchmark needs to be easy to obtain and use.
  • Independence The benchmark should be independent of the software, so that the assessment is an honest and fair one.
  • Evolution Benchmarks should be revised to prevent over-optimisation to a particular set (the benchmark) and to reflect improvements in the field. Users will want to ensure that they are using the current version of the benchmark.
  • Data quality [My own addition.] Does the benchmark also measure tolerance of poor quality data? Some types of biological data are at best approximate; this can be a major consideration for data dealing with these types of data.

The data and tasks measured

This is hard to generalise across different types of algorithms, but the key thing is that the benchmarks should identify the key challenges the algorithms need to meet and measure each of them.

Taking multiple sequence alignment of proteins as an example, here are a few of the challenges algorithms face:

  • Repetitive sequences
  • Regions with low similarity
  • Sequences of very different length
  • ’End effects’ (alignment of the sequences at the N- and C-termini)
  • Choice of gap regime
  • Gain and loss of domains (with large gaps resulting)
  • Over-representation of subsets of the protein family
  • Inclusion of false positive matches or outliers (sequences that are not member of the protein family or that have strongly diverged)

Each type of data and class of algorithms has it’s own set of specific challenges like those I listed for protein sequence alignment above.

The user wants to note the nature of their specific problem, how the problem(s) align with the challenges for that class of bioinformatics algorithm, and from that and the benchmark results what might be the better algorithm to use.

Presenting and comparing benchmarks

One question recurs in matching a user’s problem and in comparing algorithms: sensitivity and specificity. Good benchmark results should report both the sensitivity and the specificity of the algorithms. (Please note that medical sources define specificity differently.)

The word algorithm stems from the Latin version of Muhammad ibn MÅ«sā al-KhwārizmÄ«’s name (Source: Wikimedia Commons.)
The word algorithm stems from the Latin version of Muhammad ibn MÅ«sā al-KhwārizmÄ«’s name (Source: Wikimedia Commons.)

Let’s do this by alternating repetition: sensitivity, then specificity.

Sensitivity is the ‘recall rate’, the proportion of the correct answers from the dataset that are correctly identified by the method.

Specificity measures the proportion of the calls made that are correct.

Sensitivity is highest when there are few type II errors, i.e. when the method rarely falsely rejects a correct answer.

Specificity is highest when there are few type I errors, i.e. when the method rarely calls as correct a result that is in fact not.

Highly sensitive tests rarely miss a ‘true’ answer but may incorrectly call as true answers that are not. They’re sensitive, they ferret out more of the ‘true’ cases, but perhaps at the expense of dragging in some that aren’t.

Highly specific tests rarely falsely call as true an answer but may miss true answers. They’re specific, when they make a call, the call will more likely be the specific thing you’re looking for.

Formally, sensitivity = TP/(TP+FN), and specificity = TP/(TP+FP), where TP = true positive, FN = false negative, and FP = false positive.

These require yes or no, true or false calls about the results.

Algorithms that yield a measure rather than a single state can be compared using schemes that account for degrees of accuracy for their fit against any statistical model they might use. In these cases you will see more sophisticated comparisons that are beyond the scope of this essay. The principles of specificity and sensitivity remain the same, although additional features can be explored.

When comparing different methods, one can plot the true and false positive rates against one-another as one means to visualise the ability of the different methods. If you like, these illustrate the trade-off of one for the other.

Regardless of the scoring scheme used to compare algorithms, the significance of the comparison matters. Are the differences in the methods significantly different? (In the statistical sense of significant.)


Aniba, M., Poch, O., & Thompson, J. (2010). Issues in bioinformatics benchmarking: the case study of multiple sequence alignment Nucleic Acids Research DOI: 10.1093/nar/gkq625

Other articles in Code for life:

Pick the Nobel winners and win

Career paths, redux – the academic research career is the exception

GoPubMed — PubMed browsing using ontologies

The roots of bioinformatics

Autism – looking for parent-of-origin effects

0 Responses to “Choosing an algorithm – benchmarking bioinformatics”