British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 004: Practical Python for biologists - Use dictionaries

When first using Python, you quickly become familiar with for loops and lists. It is tempting to try and do everything with these. But dictionaries in Python are incredibly powerful and useful. I once wanted to do some rather complex filtering and matching on a file's contents. I wrote a for loop within a for loop within a for loop within a for loop. That's a four-level nested for loop. Then I wondered why it was going so slowly on the very large file I was using. A friend suggested that it might take the life-span of the universe to be completed. The logic was sound, but would be slow, VERY SLOW. 

But by storing the information I wanted in a dictionary, a special object in Python, I could search through this dictionary at light-speed. I will admit I do not fully understand the computer science reasons behind this, but it works. While I had been aware of dictionaries, I was scared of them. They seemed overly complicated and with no advantage over lists with for loops. But as a way of storing data and being able to look up the value quickly, they are amazing. 

The basic logic of a dictionary in Python is that each value is linked to a key. The key-value pair is at the core of dictionaries. At first I tried to imagine dictionaries as two dimensional, likening them to a table with a header row. That was a bad way to think about them. You could store a table in a dictionary, in many different ways, but you can do so much more. Dictionaries can be multi-leveled. Let's start off by making a dictionary in Python: 

$ my_dict = {} 

This makes an empty dictionary; my_dict is the variable name we are using here to store a dictionary, but you can call it just about anything: 

$ dict_for_data = {}
$ this_is_not_a_dictonary = {} #It really is a dictionary 

The empty dictionary is generated with the curly brackets {} just like an empty list is made with square brackets []. To populate the dictionary, you call the name of your dictionary (dict), followed by a key name in square brackets (I hate that it is square brackets to do this, it alway confused me), followed by an equals to assign the value. The key here is acting like a variable in Python, but it is a variable within a dictionary and it can be a string or a number: 

$ my_dict["gene"] = "RS2Z37"
$ my_dict[42] = "RS2Z38"
$ print my_dict
{42: 'RS2Z38', 'gene': 'RS2Z37'} 

There, a dictionary with two key-item pairs. Not a very useful dictionary, but a functional one all the same. Now we can look up a value associated with a key of interest: 

$ print my_dict["gene"]
RS2Z37
$ print my_dict[42]
RS2Z38 

We could make dictionaries with something more useful in them. Maybe we can store gene or protein sequences in a table and call-up the sequence by typing in the gene name? Let's try it with primer sequences to see how to do it. This is how we make a dictionary with pre-populated key-value pairs (primer sequences are for qPCR and can be found here): 

$ primers = {"EF1a_F": "CAGGGTGTCCAGAACGGTGT", "EF1a_R": "CCTCGCTCTAGCTTCCAGCA", "RS2Z37_F": "ATATGGGAGGTTGGTGGGC", "RS2Z37_R": "CAAGGAAGCACGCCTACGAT", "eIF5L1_F": "AGGAATCCTGCGTACACCAC", "eIF5L1_R": "GGGACGCAAGTTTTGAGGTA"}
$ print primers
{'EF1a_F': 'CAGGGTGTCCAGAACGGTGT', 'eIF5L1_F': 'AGGAATCCTGCGTACACCAC', 'RS2Z37_F': 'ATATGGGAGGTTGGTGGGC', 'RS2Z37_R': 'CAAGGAAGCACGCCTACGAT', 'EF1a_R': 'CCTCGCTCTAGCTTCCAGCA', 'eIF5L1_R': 'GGGACGCAAGTTTTGAGGTA'}

Now if we wanted to look up a primer sequence at will, and we knew the key name, it is as simple as before: 

$ print primers["EF1a_F"]
CAGGGTGTCCAGAACGGTGT

Simple, but not the most useful dictionary. To make it more useful, we need to create dictionaries within dictionaries. This is why thinking of a dictionary as a simple flat object is bad. Making dictionaries within dictionaries can become very complicated very quickly. The important thing to remember is the key-value pairing. But it happens that one pair's value can be the key of the next: key-key-value. Maybe we wanted to group all of the primers from the same gene into one dictionary: 

$ primers_2 = {"EF1a": {"F": "CAGGGTGTCCAGAACGGTGT", "R": "CCTCGCTCTAGCTTCCAGCA"}, "RS2Z37": {"F": "ATATGGGAGGTTGGTGGGC", "R": "CAAGGAAGCACGCCTACGAT"}, "eIF5L1": {"F": "AGGAATCCTGCGTACACCAC", "R": "GGGACGCAAGTTTTGAGGTA"}}
$ print primers_2
{'eIF5L1': {'R': 'GGGACGCAAGTTTTGAGGTA', 'F': 'AGGAATCCTGCGTACACCAC'}, 'EF1a': {'R': 'CCTCGCTCTAGCTTCCAGCA', 'F': 'CAGGGTGTCCAGAACGGTGT'}, 'RS2Z37': {'R': 'CAAGGAAGCACGCCTACGAT', 'F': 'ATATGGGAGGTTGGTGGGC'}}

Just make sure that there are no syntax errors like a missing quotation mark or colon and that there are the correct amount of curly brackets. Now we can look up both primers for a gene at once: 

$ print primers_2["RS2Z37"]
{'R': 'CAAGGAAGCACGCCTACGAT', 'F': 'ATATGGGAGGTTGGTGGGC'}

But what if we just wanted the sequence from the forward primer. Well that is nice and simple, we call the dictionary with the gene name as we have just done, and with a second pair of square brackets outside of the first pair, we can enter if we want the forward or reverse primer using an F or an R as so:

$ print primers_2["RS2Z37"]["F"]
ATATGGGAGGTTGGTGGGC

And if by magic, we have exactly what we wanted. When dealing with many more entries or more dictonary levels, the task becomes much more complicated but can also be very helpful (and fast). 

If you have not already, you can look at more dictionaries in Python for Biologists:
https://pythonforbiologists.com/dictionaries/

As always, the best way to get used to this is by using them in your work to solve problems. Don't be deterred by errors, the internet is there to help you troubleshoot! 

Compbio 005: Why should a biologist learn R?

Compbio 003: A biologist's guide to AWK