British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

Compbio 008: What the FASTQ? File formats in computational biology

There are a number of things to get used to if you are a biologist moving from the bench to the terminal. One of them is the various file formats that don't always seem to make sense. Most file formats in bioinformatics/computational biology are just plain text files - could end in .txt and be opened in Notepad or TextEdit. But these formats have their own extensions (like .gtf) and rules that enforce how the data is formatted. However, these rules are not enforced by the computer but by the human producing the file. This post will go over a some of the most common and important file formats you might come across: FASTA, FASTQ and GTF.  

The first thing to remember when looking at compbio file formats is that bioinformaticians are probably descended from pirates, given that the rules governing how these files are setup are more...guidelines. 

FASTA

FASTA is one of the most important file formats. Useful for those constructing phylogenetic trees or storing a whole genomes. This is a strange format. Instead of being a nice table with rows and columns, it has two lines per entry. A FASTA file can have a single entry or multiple entries. The first line of an entry is the name/header and always starts with a >. The rest of the line contains the name of the sequence. This name/header line can contain space. It might be 1 or chr1 if this is the chromosome of an organism, or can be extremely long, if a sequence from NCBI of a single gene. The part of the entry contains the sequence data. FASTA is useful for storing either nucleic acid or protein sequence data. The sequence can be stored on a single line or split across multiple lines, making FASTA a challenging format to work with if it is the first file format that you are trying to parse as a budding programmer! Here is an example of the first three lines from the moss genome's FASTA file. 

$ head -3 moss_genome_v3.fa
>Chr01
CTAAACCCTAAACCCTAAAGCCTAAACCCTAAACCCTAAACCCTAAAACTAAACCCTAGACCCTAAACCCTAAACCCTAA
ACCCTAAACCCTAAACCCTCAAGCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAAGCCTAAACC

There is no standard file extensions for a FASTA file as it is basically a .txt file, but .fasta and .fa are commonly used. 


FASTQ

The FASTQ file format is important because it is how reads from a sequencing run are stored - whether your experiment is a ChIP-seq or RNA-seq etc. The basic format of the FASTQ file are blocks of four rows. Yes, that's right, blocks of four rows. Why not one row with four (or more) columns, separated with a tab or comma? Hard to say. Such a file format apparently did exist called QSEQ. 

But apparently people wanted a format similar to the FASTA file format that they were used to so FASTQ won out. Clearly FASTQ is the VHS to QSEQ's Betamax. The first line of a block in a FASTQ file, representing a read is: 
First: Name of the read, starting with the @ symbol, similar to the header of a FASTA file
Second: The read's sequence
Third: Has a + symbol at the start and can be followed by the name and descriptor
Fourth: Quality/confidence of each position in the read from the basecaller (must be of equal length of the second line of the block) 

Here is an example of an RNA-seq read in FASTQ file format, showing the top four lines of a much longer FASTQ file. 

$ head -4 all_your_reads.fastq
@SRR99999999.1 1/1
GTTGGAACTAACAAAATGTACTTGAAATAGTAATATTGTTTACATTAAAC
+
BCCA@GDGGGGGGGGGGGGGGGGGGGGGGGGGDGDGEFDEGGGGGGGGGG

There is no standard file extensions for a FASTQ file as it is basically a .txt file, but .fastq and .fq are commonly used. 

GTF

You might have a whole genome sequence in a large FASTA file, but without an annotation file, how do you find the genes of interest? Where are all of the exons of the transcribed units? Where are the transposable elements that make up most of the genome of many organisms? GTF and GFF are common file formats for storing genome annotations. They are commonly required by software used to quantify expression from RNA-seq data. In eukaryotic genomes, they are useful for marking where a transcribed unit is spliced, so that read aligners know where to break a read into two or more to be aligned to the genome in a sensible, biologically relevant way. Finally, with a GTF, we have a file format that is separated into columns rather than blocks of lines, making it easier to parse by a digital dummy like myself...in theory. In practice, it isn't that simple thanks to column 9 (but more on that later). The basic format of a GTF file is that it has 9 tab-delimited columns. Nice and simple, well except for the 9th column being, well, strange. But let's start with the first eight: 

First: Chromosome or scaffold number
Second: The program or organization that generated this feature (such as Cufflinks or phytozome) 
Third: Feature name, such as exon or transcript or CDS or start_codon
Fourth: Starting position of the feature. The first base is numbered 1 (rather than 0 as in some file formats) 
Fifth: Ending position of the feature (inclusive) 
Sixth: A score between 0 and 1000. Usually there is no score and the value of ":." is entered
Seventh: Strand of the feature ("+", "-", or "." if unknown or not required) 
Eighth: Reading frame of the feature if known or needed. Marked with a "." if not known or needed 

All of these are pretty straight forward. Often, GTF files will simply contain one type of feature (designated in the third column), such as the output from tools like Cufflinks, with only exons populating the file. When you think about it, this is all that is really required. If a transcript is a single exon, it has one line in the GTF file, with a start and stop location of that exon. For a transcript with multiple exons from splicing, then it will require multiple lines to represent this transcript, with each line representing a different, non-overlapping exon. But there is no information in the first eight columns describing the name of the transcript or the name of the gene. Where is this stored: in the infamous 9th column. The 9th column is so strange because it is kinda like a second table within the table. The first two entries of this table must be 1) the gene ID and 2) the transcript ID. But then other attributes can be added, like exon number. 

To keep this 9th column organized and readable, given how much they might vary (from the essential two attributes to several attributes, many of which only the file's creator will find interesting), the 9th column is organized as key-value pairs. The key would be text such as gene_id, followed by a space and doublequotes around the value. Then to separate this from the next key-pair value, this is followed by a semi-colon and a space: 

$ gene_id "gene_of_interest"; transcript_id "transcript_of_interest";

Already you can start to imagine how much fun it is to parse this. Needing to separate out the key-value features of the attribute. Especially if your genome has oddly named genes with a semicolon in the name! (Yes this did happen to me! Thanks Arabidopsis thaliana). 

If working with Python, please feel free to use the below function, written by my friend and Python mentor Courtney French, to extract a value from a particular key (this should account for semicolons in the name of the feature you are interested in): 

def pullValFromGTFinfo(info, tag):
   info = info.split()
   i = 0
   while i < len(info):
       if info[i] == tag:
           return info[i+1].replace("\";","").replace("\"","")
       i += 1
   return

You can download this function here with an example of how to call it. 

But be careful with GTF files, different people really have different ideas on how to implement it. Take GTFs from UCSC, a common place to download genome annotations from. They use the barebones version with only gene_id and transcript_id in the 9th column but with one major issue: the gene_id and transcript_id for an entry are identical. 

$ head -3 UCSC_genes_hg19.fq
chr1    hg19_knownGene  exon    11874   12227   0.000000        +       .       gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg19_knownGene  exon    12613   12721   0.000000        +       .       gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg19_knownGene  exon    13221   14409   0.000000        +       .       gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";

To focus on only the 9th column, I can use the cut command in bash:

$ cut -f 9 UCSC_genes_hg19.fq | head -3
gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";

The transcript_id is used to populate the gene_id. This causes problems for some tools. Take SUPPA for example, where it needs transcripts to be grouped into genes to quantify alternative splicing. This way to implement means that you, the user of a UCSC GTF has to figure out how to group transcript IDs. 

Hopefully this has given you a taste of what file formats exist. To learn more about these and other file formats, check out this excellent FAQ on the UCSC genome browser's website, which details the common file formats in bioinformatics. 

Compbio 009: Practical Python for biologists - What are modules?

Compbio 007: A better way to show your data with R (an end to bar plots)