## Jan 9 Compbio 018: Getting more significance out of R

While doing research, I have performed statistical tests that have yielded incredibly small p-values. This is not unusual when working with large datasets in genomics. In fact, when using R, I can often get p-values so small that they cannot be accurately represented by the numerical system used. The result is reported as "p-value < 2.2e-16". Here is such an example:

\$ a <- 1:10
\$ b <- 100:110
\$ t.test(a,b)
Welch Two Sample t-test
data:  a and b
t = -71.87, df = 18.998, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-102.39768  -96.60232
sample estimates:
mean of x mean of y
5.5     105.0

As you can see, we only know the p-value is smaller than 2.2e-16; the comparison of any two p-values smaller than this are currently impossible. One way to estimate the p-value below 2.2e-16 is by adding "\$p.value" to after your statistical test in R:

\$ t.test(a,b)\$p.value
 1.311457e-24

Here we have a p-value estimated to an exact number, rather than the very broad category of "< 2.2e-16". If we did a different test, between sample a, and another sample, we can see that this is highly significant (p-value < 2.2e-16):

\$ c <- 60:70
\$ t.test(a,c)\$p.value
 2.159892e-20

However, by examining the estimated p-value above, we see that it is a larger p-value than the a vs b test, which is what we would expected from the tested values.

Whether you should do this is a different matter. Forums are filled with complex discussions on the loss of accuracy in calculating numbers < 2.2e-16. But if you "need" to present numbers below 2.2e-16 from statistical tests in R, this is a simple solution to that problem. And if the solution is problematic, I look forward to an active discussion about why and what else we can do.

For more discussion on this topic, and the code I used above, check out this Q&A on stackoverflow: