Perhaps I am the only person stupid enough to not realise this, but apparently there are more than one way to organise your data into a table. Never something I had to actively think about. Specifically when we are organising (biological) data into a table, we can either make the table a wide (or stacked) or long (or narrow or un-stacked) format.
My first experience of working with large datasets was with the output of Cuffnorm and similar tools, which output data in the wide format. An example is below:
Transcript Control_expression Treatment1_expression Treatment2_expression Transcript_length
T_0001 10 20 30 500
T_0002 5 5 5 1000
So here each line/row has a unique identifier (in this case, the transcript ID) and the dependent variable (expression) is in different columns, representing different experimental conditions (control, treatment 1 and treatment 2). Any other information related to the transcript is also reported in additional column (e.g. transcript length). To me, this is the most intuitive way to generate a table. But apparently this is not a universal trait. Many prefer the long format. Other tools (like Sleuth) output long format tables as default and ggplot in R assumes a long format table. A long format version of this same data would look like this:
Transcript Condition Expression Transcript_length
T_0001 Control 10 500
T_0001 Treatment1 20 500
T_0001 Treatment2 30 500
T_0002 Control 5 1000
T_0002 Treatment1 5 1000
T_0002 Treatment2 5 1000
In the long formated table, each transcript is represented by multiple times. Here, what makes a row unique is the combination of the Transcript ID and the Condition. Information that is the same for each transcript is replicated (e.g. transcript length). One advantage of recording data in this way, other than getting ggplots to work, is that addition of a new field does not require re-structuring the table (you keep the same header).
But what happens if you have the data in one table type but need the other. Well R does have a library called reshape2, which does a great job of reforming tables such as these. For example, if you wanted to go from a wide to a long, you can use the function melt. Imagine you have loaded the first table into R as a dataframe called df_wide.
$ df_long <- melt(df_wide, id.vars = c("Transcript", "Transcript_length"), variable.name = "Condition", value.name = "Expression")
Transcript Transcript_length Condition Expression
1 T_0001 500 Control_expression 10
2 T_0002 1000 Control_expression 5
3 T_0001 500 Treatment1_expression 20
4 T_0002 1000 Treatment1_expression 5
5 T_0001 500 Treatment2_expression 30
6 T_0002 1000 Treatment2_expression 5
So here, we have generated the long format table from the wide format with a single line of code. I am always amazed by what R can do for me with a simple function, which I would otherwise have had to spend half an hour writing buggy Python code to do. Not that I am still bitter at my ignorant the past self anymore. Not at all.
It is also possible to go from a long format table to create a wide format table, but this is sadly not as straightforward. For this we need the dcast function. Here we take the two columns whose name will remain unchanged (Transcript + Transcript_length) and then the column which contains the names of the new columns names (~ Condition), in this case, describing the treatments. Finally, we give it the column for which the values will be take from to populate the new columns (, value.var = "Expression"), this being the all important expression value.
$ df_wide_again <- dcast(df_long, Transcript + Transcript_length ~ Condition, value.var = "Expression")
Transcript Transcript_length Control_expression Treatment1_expression Treatment2_expression
1 T_0001 500 10 20 30
2 T_0002 1000 5 5 5
For more information on how melt and dcast of rshape2 work, please check out: