British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 014: Practical Python for biologists - Finding motifs with Python

Have nucleotide or protein sequences and want to see what motifs are contained within them? Luckily for you, this is fairly straightforward with the regular expressions (re) module of Python (What are modules?). 

The regular expressions module has some powerful tools for searching text, so let's start by importing it. 

$ import re

We will start with something simple, we will store a protein sequence as a variable. The transcriptional co-repressor TOPLESS (TPL) binds to short motifs in proteins. LxLxL (where x is any amino acid) is a well characterized TPL binding motif. But other motifs exist, such as LxLxPP, (R/K)LFGV and TLxLF. 

The moss (Physcomitrella patens) Aux/IAA protein, which represses auxin-mediated gene expression, IAA1b (Pp3c8_14720V3) contains the LxLxPP repression motif - but don't take my word for it, let's check for ourselves. We will use the search() function from the regular expression module. It will return True if the string contains the motif. 

$ if"L*L*PP", peptide):
$     print("Repression domain found")
$ else: 
$     print("Nope, nothing")
Repression domain found

Yay, we were right and the peptide contains the motif. We have actually done some wet-lab work on this and found that mutation of the second L in LxLxPP prevented the interaction between PpTPL1 and PpIAAb in the yeast-two hybrid assay (TOPLESS co-repressor interactions and their evolutionary conservation in plants). 

Now for a more complex example: we want to count the number of a motif across multiple sequences. I recently co-authored a paper looking at the evolution of an RNA decay pathway (Conservation of nonsense-mediated mRNA decay complex components throughout eukaryotic evolution). The essential details are: UPF1 is phosphorylated by SMG1 at SQ or TQ dipeptides. However, some species have lost SMG1. These include baker's yeast, the model flowering plant (Arabidopsis thaliana) and brown algae (SMG1 is an ancient nonsense-mediated mRNA decay effector). Moss, humans and some fungi retain SMG1. So, when SMG1 is lost from a genome, is UPF1 depleted of S/TQ dipeptides? Does the age of SMG1 loss affect the depletion of UPF1 os S/TQ dipeptides? Afterall, A. thaliana lost SMG1 recently, baker's yeast lost it millions of years ago. 

Now we can read in the fasta file (for description of a fasta file, click here) with the sequences of interest, to download fasta file here. Processing a fasta file will require us to open and close an input and output file (for opening and saving files with Python). 

$ infile = "/home/jamesl/Bad_grammar_good_syntax/input014_UPF1.fa" #Setting the path to the input file
$ infile = open(infile, 'rU') #Open the file for reading
$ lines = infile.readlines() #Makes list of all the lines

Then set the output file:

$ outfile = "/home/jamesl/Bad_grammar_good_syntax/014_UPF1_outfile.txt" #Setting the path to the output file
$ outfile = open(outfile, 'w') #Open/create output file

Now we can make a dictionary (introduction to dictionaries here) to store our fasta sequences: name of sequence is the key, sequence itself as value. The dictionary will be called seq_dict:

$ seq_dict = {}
$ for line in lines:
$     line = line.strip() #Remove newline symbol (\n)
$     if line.startswith(">"):
$         name = line
$         seq_dict[name] = []
$     elif line.startswith(">") == False:
$         seq_dict[name].append(line)

This code identifies if the line is a ID line in a fasta file (starts with a >) or not (sequence line) and will combine all the lines representing the sequence into one. To make a header for our output file, we can do this:

$ header = ["UPF1 sequence", "SQ number", "TQ number", "S/TQ total", "Percentage S/TQ", "S/TQ Richness"]
$ outfile.write('\t'.join(header) + '\n') #Write the header to the outfile

Now for the meat of the code: finding the S/TQ dipeptides in the sequence. We do this with the findall() function from the regular expressions module of Python. The exact function we are using will be this: re.findall(r"[ST]Q", protein). Let's dissect this. First, re. is calling the regular expressions module, where the function is stored. Second, findall() is the function within re. Third, r"[ST]Q" is the string we are looking for. The r before "[ST]Q" indicates to Python to ignore any special characters it might encounter. The [ST] means that either an S or an T can be located in this position and the Q is a must. Finally, protein is the variable name we will set to holding the protein sequence we are checking. This function will return a list of all of the SQ and TQ dipeptides we identify. From there, it is simple to iterate over the list to count how many SQ and TQ dipeptides are present. The code below does all of this from the dictionary we made: 

$ for seq in seq_dict:
$     ''.join(seq_dict[seq])
$     protein_length = float(len(seq_dict[seq][0]))
$     protein = seq_dict[seq][0]
$     ST_Q_list = re.findall(r"[ST]Q", protein) #Makes list of all S/TQ
$     SQ_count = 0
$     TQ_count = 0
$     for dipeptide in ST_Q_list: #Loop through all dipeptides
$         if dipeptide == 'SQ': #Counts SQ dipeptides
$             SQ_count += 1
$         elif dipeptide == 'TQ': #Counts SQ dipeptides
$             TQ_count += 1
$         else: 
$             print('Houston, we have a problem')
$     new_line = []
$     new_line.append(seq)
$     new_line.append(str(SQ_count))
$     new_line.append(str(TQ_count))
$     new_line.append(str(SQ_count+TQ_count))
$     percentage = (((SQ_count+TQ_count)/protein_length)*100)*2
$     new_line.append(str(percentage))
$     if percentage >= 2:
$         new_line.append('Rich')
$     elif percentage < 2:
$         new_line.append('Poor')
$     outfile.write('\t'.join(new_line) + '\n') #Write the line to the outfile
$ infile.close() #Closes the file (must do)
$ outfile.close() #Closes the file (must do) 

Now we can check the outfile to examine S/TQ richness of UPF1 proteins from each species:

$ head 014_UPF1_outfile.txt
UPF1 sequence   SQ number       TQ number       S/TQ total      Percentage S/TQ S/TQ Richness
>Arabidopsis_thaliana_AT5G47010.1       20      8       28      4.46570972887   Rich
>Bakers_yeast_NP_013797.1       4       2       6       1.23583934089   Poor
>Brown_algae_CBJ30064.1 4       2       6       1.08991825613   Poor
>Moss_UPF1a_Pp1s44_135V6.1      15      15      30      4.55927051672   Rich
>Rhizophagus_irregularis_DAOM_197198w_EXX60126.1        17      9       26      4.89181561618   Rich

The output here might not be very readable, but what we see is that with the exception of A. thaliana, all species without SMG1 (Baker's yeast and brown Algae) have a low/poor S/TQ dipeptide level (<3% of total protein). Species with SMG1 (moss and the fungi, Rhizophagus irregularis) have an S/TQ rich (>=3% of total protein) UPF1. A. thaliana is rather unusual in that it lacks SMG1 but retains a high S/TQ level in UPF1. A. thaliana only recently lost SMG1 (within the last 5-10 million years since it split from A. lyrata), so perhaps not been enough time has elapsed for loss of S/TQ dipeptides through genetic drift. Alternatively, perhaps A. thaliana has co-opted another kinase to phosphorylate the same residues. More research is required (granting bodies, feel free to give us money). 

You can download the Jupyter notebook containing all of the code for this analysis here. I hope this post has not been too much of a shameless plug for my research; instead I hope the real world examples will help demonstrate how useful Python programming can be.

To learn more about regular expressions, I REALLY recommend Python for Biologists: 

Compbio 015: Practical Python for biologists - Processing two input files

Compbio 013: Practical Python for Biologists - Opening and saving files