British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 017: Is your overlap significant?

Imagine you have a really exciting new dataset. An RNA-seq or ChIP-seq dataset, with lots of genes that change between your conditions. And let's say you have a hunch that this has some similarities to some other treatment. Well, one thing to do would be to correlate the samples and see how similar they are. Another thing would be to cluster multiple samples together and see which are the most similar. But perhaps you don't expect similar changes at the global scale; instead you expect a subset of genes to be changed in both. They overlap by an extent greater than predicted by chance. Certainly you should try visualizing the overlap with a size-dependent Venn diagram (for how to make one in Python, link here). 

To assess the significance of the enrichment, you'll need a statistical test. First, find the number of genes that change in both experiments (eg increased expression genes or decreased histone PTM). Let's say that you have 2000 genes in experiment 1 and 4000 genes in experiment 2 and the overlap is 700. 

Now what you need is the hypergeometric test in R. Load up R and we can use the awesome phyper function: 

$ phyper(q, m, n, k, lower.tail = FALSE, log.p = FALSE) 

where q=(size of overlap-1), m=number of differential genes in experiment 1, n=(total number of genes on platform-m), and k=number of differential genes in experiment 2. 

The potentially tricky last part is the total number of genes on the platform (minus the differential genes of experiment 1). On paper this should be easy, but different studies might be working on the same organism, but with different sets of the genes. 

Some studies might expand the transcriptome to include more non-coding RNAs or use a technology that is missing some genes (eg microarray). Therefore, it is worth checking what genes are in common between the two studies before setting the platform number; this includes the number that are differential in your two experiments. For example, in experiment 1, there might be 2000 differential genes out of 30000, but in experiment 2, 4000 differential genes out of 25000 genes. All 4000 differential genes of experiment 2 might be included in platform (background) of experiment 1. However, only 1800 differential genes of experiment 1 might be included in experiment 2. We want to compare like for like. So we should use the overlap of the genes in the background (the platform) and adjust our numbers. Now we can enter the numbers into the test: 

$ phyper(700-1, 1800, 25000-1800, 4000, lower.tail = FALSE, log.p = FALSE) 
[1] 1.726242e-132

If we had not accounted for the different studies using different platforms (backgrounds), we would actually have underestimated the significance of the overlap (but it would have been inaccurate): 

$ phyper(700-1, 2000, 25000-2000, 4000, lower.tail = FALSE, log.p = FALSE) 
[1] 6.921924e-106

The earlier overlap is a highly significant result, but what about another overlap? From the same species, with a platform of 25000 genes, an overlap of 195 between two experiments (1500 and 3000 differential in each). 

$ phyper(195-1, 1500, 25000-1500, 3000, lower.tail = FALSE, log.p = FALSE)
[1] 0.1180457

As you can see, we cannot reject the null hypothesis* so we should move on to the next part of the project. 

With this simple test, you can quickly look for enrichment using statistics. 

Many thanks to the Biostars discussion on this topic where I took code and inspiration from. 


*Footnote: This has been updated as originally the wording was an inaccurate description of how to consider statistical tests. Many thanks to Prof Lior Pachter who pointed this out on Twitter. 

Compbio 018: Getting more significance out of R

Compbio 016: Practical Python for biologists - Moving from Python2 to Python3