British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 005: Why should a biologist learn R?

If you are already learning another programming language (like Python), then you might have the attitude of "why should I also learn R"? It is certainly true that pretty much everything you do in R, you can do in Python. There are in fact many modules that replicate features of R in Python. If you are wondering what to learn as your first language, R is a powerful way to analyse data. I find it much more direct for dealing with data but there is less focus on some of the basic mechanics of programming (like "for loops") when learning and using R. 

I resisted learning R for a long time; I was happy to use Python to do everything. But I finally changed my mind and I am happy to have R at my disposal. So here are some reasons to learn a little R, even if you are already learning another language for computational biology: 

1. R is centred around the dataframe object: dataframes are essentially tables. Much of what we do in computational biology is centred around manipulating tables of text and numbers. R is a great language for this central part of computational biology, with many functions and libraries that can manipulate dataframes in powerful ways - saving time compared to writing your own loops and functions in Python. 

2. Plotting is beautiful and (fairly) straightforward with ggplots. I love matplotlib as a Python module to make plots but I have often run into situations where what I wanted to do, was far easier in R with ggplots despite having less experience with it. So I switch between the two, using the best tool for the task at hand. See this beautiful univariant scatter plot made in ggplot compared to a bar plot made in Excel (post on how to make univariant plots in R coming soon). 


3. A lot of bioinformatics software is written as an R package. This means that if you want to use a certain program, you cannot just run it on the command line from a binary, like you can with Cufflinks or Salmon, but you have to load up R and either enter the code line-by-line or run an R script. Learning to use the R package Sleuth to analyse RNA-seq data forced me to learn more R. I could only get so far by copy-pasting the commands in the tutorial before I had to customize the commands I was entering, and knowing a little R went a long way. 

It is true that you can use the Pandas module in Python to replicate dataframes. That is great, many people never learn R and are happy to do everything either without dataframes in Python or by using Pandas. I am not a fan of Pandas, but I think that has more to do with how my head works - I just find the syntax for handling dataframes diffecult in Pandas, despite normally like Python's syntax more than R. 

If you are in the middle of learning your first (non-R) language, then I'm not sure jumping into R right away is the best approach. Get good with what you are doing. But when the day comes when you want to do some stats, modify a table in a strange way or use a library for analysis which is distributed as an R package, be willing to learn R - it will not be easy. But I have spent hours writing Python code to re-arrange a table, which takes ages to run, only to discover much later that R has a function to re-arrange the table exactly as I wanted in a single line of code, which runs almost instantly. I felt rather stupid for not realizing sooner. 

There are many ways to learn R. Here is one research to check out:

As with most things, the easiest way to learn is by doing. Doing it with your own data to solve a problem you have. Next time you run into a problem you are really having a tricky time with, do some Googling to see if it has an easy solution in R.   

Compbio 006: More than one way to arrange a table - the long and wide of it

Compbio 004: Practical Python for biologists - Use dictionaries