British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 006: More than one way to arrange a table - the long and wide of it

Perhaps I am the only person stupid enough to not realise this, but apparently there are more than one way to organise your data into a table. Never something I had to actively think about. Specifically when we are organising (biological) data into a table, we can either make the table a wide (or stacked) or long (or narrow or un-stacked) format. 

My first experience of working with large datasets was with the output of Cuffnorm and similar tools, which output data in the wide format. An example is below: 

Transcript    Control_expression    Treatment1_expression    Treatment2_expression    Transcript_length
T_0001    10    20    30    500
T_0002    5    5    5    1000

So here each line/row has a unique identifier (in this case, the transcript ID) and the dependent variable (expression) is in different columns, representing different experimental conditions (control, treatment 1 and treatment 2). Any other information related to the transcript is also reported in additional column (e.g. transcript length). To me, this is the most intuitive way to generate a table. But apparently this is not a universal trait. Many prefer the long format. Other tools (like Sleuth) output long format tables as default and ggplot in R assumes a long format table. A long format version of this same data would look like this: 

Transcript Condition    Expression    Transcript_length
T_0001    Control    10    500
T_0001    Treatment1    20    500
T_0001    Treatment2 30    500
T_0002    Control    5    1000
T_0002    Treatment1    5    1000
T_0002    Treatment2    5    1000

In the long formated table, each transcript is represented by multiple times. Here, what makes a row unique is the combination of the Transcript ID and the Condition. Information that is the same for each transcript is replicated (e.g. transcript length). One advantage of recording data in this way, other than getting ggplots to work, is that addition of a new field does not require re-structuring the table (you keep the same header).

But what happens if you have the data in one table type but need the other. Well R does have a library called reshape2, which does a great job of reforming tables such as these. For example, if you wanted to go from a wide to a long, you can use the function melt. Imagine you have loaded the first table into R as a dataframe called df_wide. 

$ library("reshape2")
$ df_long <- melt(df_wide, id.vars = c("Transcript", "Transcript_length"), variable.name = "Condition", value.name = "Expression")
$ df_long
  Transcript Transcript_length             Condition Expression
1     T_0001               500    Control_expression         10
2     T_0002              1000    Control_expression          5
3     T_0001               500 Treatment1_expression         20
4     T_0002              1000 Treatment1_expression          5
5     T_0001               500 Treatment2_expression         30
6     T_0002              1000 Treatment2_expression          5


So here, we have generated the long format table from the wide format with a single line of code. I am always amazed by what R can do for me with a simple function, which I would otherwise have had to spend half an hour writing buggy Python code to do. Not that I am still bitter at my ignorant the past self anymore. Not at all. 

It is also possible to go from a long format table to create a wide format table, but this is sadly not as straightforward. For this we need the dcast function. Here we take the two columns whose name will remain unchanged (Transcript + Transcript_length) and then the column which contains the names of the new columns names (~ Condition), in this case, describing the treatments. Finally, we give it the column for which the values will be take from to populate the new columns (, value.var = "Expression"), this being the all important expression value. 

$ df_wide_again <- dcast(df_long, Transcript + Transcript_length ~ Condition, value.var = "Expression") 
$ df_wide_again
  Transcript Transcript_length Control_expression Treatment1_expression Treatment2_expression
1     T_0001               500                 10                    20                    30
2     T_0002              1000                  5                     5                     5


For more information on how melt and dcast of rshape2 work, please check out:
http://seananderson.ca/2013/10/19/reshape.html

Compbio 007: A better way to show your data with R (an end to bar plots)

Compbio 005: Why should a biologist learn R?