1 Comment

A ‘lite’ review of an attempt to improve motif screening of genomes using additional data. Considering local minima of histone acetylation in a context of high histone acetylation may be one way forward.

A number of years ago at a biotechnology sector after-talk drinks I found myself trying to convince a CEO that often the better way to improve a predictive method is not to continually refine the one measure, which often results in a ’law of diminishing returns’ effect, but to find a complementary independent measure and use the original measure and the new measure together to improve the prediction success.

Recent years have seen similar approaches taken to improve scanning genomes for transcription factor binding sites.

Transcription factors regulate the rate of expression of genes by binding short DNA sequences in the genome, in neighbour of the gene they regulate.

Conceptually at least, the epigenetic state of a region of the genome controls if the gene is accessible to be used and hence be the target of a transcription factor.*

(Original from Luger lab, sourced from Biomedical Beat.)

Array of nucleosomes. The ‘tails’ that are chemically modified are indicated with dashed lines. (Original from Luger lab, sourced from Biomedical Beat.)

DNA in eukaryotes is packaged using histone proteins into arrays of nucleosomes. Chemical modification of the histone proteins correlate with the nature of the packing of the DNA, to form either condensed or open chromatin. Active genes are associated with open chromatin.

Open chromatin – portions of the genome associated with genes that are actively being used –are charactered by a number of features in their chromatin such as the acetylation of histone H4.

Some experimental approaches identify the binding sites of some transcription factors (more-or-less) directly, in particular a technique known as ChIP-seq (ChIP = Chromatin Immuno-Precipitation). As the name implies this involves antibody-based isolation of the DNA sequences bound by the transcription factors to obtain short DNA sequences that the transcription factor is binding to.

Ribbon model (cartoon) of Zn2Cys6 zinc finger dimer bound to DNA (Source: Wikimedia Commons.)

Ribbon model (cartoon) of Zn2Cys6 zinc finger dimer bound to DNA (Source: Wikimedia Commons.)

Besides giving the (general) location of where the TF binds, with a large collection of the DNA sequences bound by one transcription factor collected in this way, you can determine the common DNA sequence within the DNA sequence fragments found (which itself has issues that I am not going to cover here). This common DNA sequence – the binding site for the transcription factor – is usually represented as a profile or motif.

While a very useful approach, this will not work (well) for all transcription factors. Like all experimental approaches, some transcription factors will not ‘play along’.

Prior to ChIP-seq being common, a number of approaches started by first identifying what DNA sequences the protein (the transcription factor) bound, then searching the genome for matches to the profile or motif identified.

The first step in these methods are variations on the concept of creating a mixture of DNA sequences (a sequence library) and screening these to see what DNA sequences the protein bound.

One problem with the second step, screening a genome sequence with a profile (or motif), is that while a match to the profile might be found, in practice that short DNA sequence might not be accessible to the protein in the cell, which is to say that profile (motif) screening tends to give many false positive matches. (There are further issues: I’m trying to keep this simple and brief.)

RCC1 bound to nucleosome. (Source: Penn State Science.)

RCC1 bound to nucleosome. (Source: Penn State Science.)

In similar fashion to my conversation with the CEO, Stephen Ramsey and colleagues reasoned that they could improve predictions of transcription factor binding by using additional independent measures such as the epigenetic state of the region of the genome.

Broadly, the concept is that if a matching motif was found within chromatin that was open in the cell type the transcription factor was active in, the matched site was more likely to be one bound by the transcription factor.

(In a similar way, (co-called) gene expression data has been used to aid motif detection and prediction. Conceptually you can think of using measures of the chromatin state as a more direct way of assessing the setting a transcription factor works in: after all, gene expression data are mature RNA levels, which are several steps after the transcription factor binding DNA.)

Shmulevich’s team are not the first to combine other measures, including chromatin state-related ones, to transcription factor binding site motif identification. (For those interested, the paper gives a round-up of most of these attempts.)

They wanted to test what additional measures, in addition to a motif for the transcription factor, might improve the prediction of binding sites.

To test this five transcription factor motifs were considered.** Four were used to train their prediction model, using the area under a sensitivity versus false positive rate (FPR) curve as the target of the training, the fifth being used to test the trained model. This is applied with each TF being the one left out for testing, in turn.

Taken from Fig 1A of Ramsey et al. (See reference for source details.)

Taken from Fig 1A of Ramsey et al. (See reference for source details.)

The bottom-line is that local regions of local histone acetylation in the context of high histone acetylation usefully lifted the prediction accuracy of the motif screens.

For those wanting a little more detail without having to dig it out of the paper – which is very readable, by the way –here is a little more on what they did.

The five additional measures they tested were histone acetylation ChIP-seq scores, the presence of local minima of histone acetylation (derived from the HAc ChIP-seq scores), GC content (regions of high GC content have indicated to be associated with gene regulatory regions), vertebrate sequence conservation (if the local region is conserved in other species, it is likely to have a functional role), and the presence of nucleosomes (TF binding is associated with regions of low nucleosome occupancy). They also considered if the environmental state of the cell mattered (it does, as you might expect).

Testing pairs of these measures, with one being the motif and the other being the additional measure, they found that local minima of histone acetylation and the motif gave the largest improvement in prediction accuracy. The HAc ChIP-seq alone offered some improvement, but to varying degrees from one transcription factor to another. GC content and sequence conservation shows small but consistent improvements in prediction accurately and so were statistically significant.

Three-feature models did not show statistical improvement over the two-feature models; this may be a reflection of the limited binding site data used.


* Clearly, this is meant as a sweeping generalisation.

** The five transcription factors were: ATF3, C/EBPδ, IRF1, NFκB/p50 and NFκB/p65.


Ramsey, S., Knijnenburg, T., Kennedy, K., Zak, D., Gilchrist, M., Gold, E., Johnson, C., Lampano, A., Litvak, V., Navarro, G., Stolyar, T., Aderem, A., & Shmulevich, I. (2010). Genome-wide histone acetylation data improve prediction of mammalian transcription factor binding sites Bioinformatics, 26 (17), 2071-2075 DOI: 10.1093/bioinformatics/btq405

Other articles on Code for life:

Epigenetics and 3-D gene structure

Loops to tie a knot in proteins?

Finding platypus venom

Choosing an algorithm – benchmarking bioinformatics

The roots of bioinformatics