British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 018: Getting more significance out of R

While doing research, I have performed statistical tests that have yielded incredibly small p-values. This is not unusual when working with large datasets in genomics. In fact, when using R, I can often get p-values so small that they cannot be accurately represented by the numerical system used. The result is reported as "p-value < 2.2e-16". Here is such an example:

$ a <- 1:10
$ b <- 100:110
$ t.test(a,b)
    Welch Two Sample t-test
data:  a and b
t = -71.87, df = 18.998, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -102.39768  -96.60232
sample estimates:
mean of x mean of y
      5.5     105.0 

As you can see, we only know the p-value is smaller than 2.2e-16; the comparison of any two p-values smaller than this are currently impossible. One way to estimate the p-value below 2.2e-16 is by adding "$p.value" to after your statistical test in R: 

$ t.test(a,b)$p.value
[1] 1.311457e-24

Here we have a p-value estimated to an exact number, rather than the very broad category of "< 2.2e-16". If we did a different test, between sample a, and another sample, we can see that this is highly significant (p-value < 2.2e-16): 

$ c <- 60:70
$ t.test(a,c)$p.value
[1] 2.159892e-20

However, by examining the estimated p-value above, we see that it is a larger p-value than the a vs b test, which is what we would expected from the tested values. 

Whether you should do this is a different matter. Forums are filled with complex discussions on the loss of accuracy in calculating numbers < 2.2e-16. But if you "need" to present numbers below 2.2e-16 from statistical tests in R, this is a simple solution to that problem. And if the solution is problematic, I look forward to an active discussion about why and what else we can do. 

For more discussion on this topic, and the code I used above, check out this Q&A on stackoverflow:

Compbio 019: Getting the most out of your data on the command line

Compbio 017: Is your overlap significant?