## Oct 23 Compbio 007: A better way to show your data with R (an end to bar plots)

If you want to present some data, like expression data from RNA-seq or qRT-PCRs, your first instinct might be to present the data as a bar plot with error bars. That has been the standard for a long time. However, as pointed out by this wonderful PLOS Biology article, bar plots are not usually the best way to present this type of data. They can be misleading. With small sample sizes (n < 20), summary statistics (mean, standard deviation and standard error of the mean) are rather pointless. With an n of three (common in molecular biology), a much better alternative is to just show the values! Therefore a univariatescatter plot is a much better choice. So here, I will show you how to make a univariate scatter plots in R. Hopefully the code can be easily modified for your own use case. I used RStudios on a Windows machine and the script/code can be downloaded here. The rest of the blog post covers the code and what it is doing.

Here is the what the bar plot would look like if made in Excel (the Excel workbook can be downloaded here).

It is fine. But we can do better and make our results clearer to the reader. By using ggplots in R, it is very simple to make beautiful univariatescatter plots.

First, we need to load the data (this command makes a dataframe from a tab-delimited text file; this input file can download here). The file path is formatted for a Windows machine.

\$ df
Treatment Replicate Relative_expression
1         WT         1                 0.5
2         WT         2                 0.4
3         WT         3                 0.3
4       mutA         1                 1.1
5       mutA         2                 1.0
6       mutA         3                 1.7
7       mutB         1                 0.4
8       mutB         2                 0.3
9       mutB         3                 0.8
10      mutC         1                 1.3
11      mutC         2                 1.4
12      mutC         3                 1.5

Now that we have the table stored as a dataframe in R, we can make the plot. But first we need to load the library ggplots, which contains all of the functions to actually make the plot.

\$ library(ggplot2)

We should also make a vector (an R object, like a list) which contains our treatment/conditions in it. They need to be in the order (left to right) that we want our plot to take. R will, by default, use alphabetical order when arranging the treatment/conditions. I want WT (for wild-type) to be the leftmost sample.

\$ labels <- c("WT", "mutA", "mutB", "mutC")

And now for the meat of it. This code, in RStudios, will make the plot appear in the in the plot output box (lower right hand corner).

\$ ggplot(df, aes(df\$Treatment, df\$Relative_expression)) + geom_point(size=3) + stat_summary(fun.y = mean, fun.ymin = mean, fun.ymax = mean, geom = "crossbar", color = "blue", size = 0.3) + scale_x_discrete(limits=labels) + expand_limits(y=0)

Fantastic, we have the plot we wanted. We can see the data itself, rather than just the summary of the data!

If you wanted to change the size of the points, simply increase "geom_point(size=3)" to a value higher or lower, as required. I have added a blue bar to represent the mean. By altering the size and colour in "stat_summary(fun.y = mean, fun.ymin = mean, fun.ymax = mean, geom = "crossbar", color = "blue", size = 0.3)", you can change its appearance. Deleting this will remove the blue bar. Duplicating that code and changing the "fun.y = mean, fun.ymin = mean, fun.ymax = mean" to a different summary stat, like median, along with changing the colour, will add a new bar representing what you requested. "scale_x_discrete(limits=labels)" ensures that the vector we inputted is used to order the plot, rather than using alphabetical ordering! Finally, ggplots has a nasty habit at starting the Y axis at a value near the lowest value on the plot rather than at zero. By adding "expand_limits(y=0)", we can force the Y axis to start at zero, as it should.

Great, but imagine that you had more data points, or you wanted to prevent data points with similar values from overlapping. One way to do this is to add 'jitter'. To do this, you can call the " geom_jitter()" rather than the "geom_point()" we were calling earlier. By altering the width parameter, we can adjust how much jitter (how much free range the points have to move from the centre). Here I am using very little jitter (for the few points we have of 0.2, but adjust up and down until you are happy with how your plot looks (code for the jitter plot here; thanks to @OliverBerkowitz on Twitter for suggesting this).

ggplot(df, aes(df\$Treatment, df\$Relative_expression)) +
geom_jitter(width = 0.2, size=3) + stat_summary(fun.y = mean, fun.ymin = mean, fun.ymax = mean, geom = "crossbar", color = "blue", size = 0.3) +
scale_x_discrete(limits=labels) + expand_limits(y=0)

Now we can get a much better sense of what is going on in each condition. mutA and mutC look rather similar, except for a single replicate in each. Identifying outliers with univariate plots is straightforward. If there were enough samples, we might even be able to detect two sub-populations within a condition. But much of this would be masked by using bar plots. With this code, I wish you luck in starting to present your data in a more simple and easy to interpret way.