High-throughput technologies have revolutionized the field of life sciences, producing vast quantities of data at unprecedented speed. However, to make sense of these massive data sets, you need to familiarize yourself with the different types of file formats typically used. Below, we will discuss some of the most common file formats in high-throughput research, including FASTQ, FASTA, BAM/SAM, and BAI.
File Format | Link/Anchor |
---|---|
FASTQ | FASTQ File Format |
FASTA | FASTA File Format |
BAM/SAM | BAM File Format |
BAI | BAI File Format |
SAM | SAM File Format |
VCF | VCF File Format |
GFF/GTF | GFF/GTF File Format |
BED | BED File Format |
BedGraph | BedGraph File Format |
BigWig | BigWig File Format |
PDB | PDB File Format |
FASTQ files are widely used in bioinformatics for storing raw sequence data and corresponding quality scores. Each entry in a FASTQ file includes four lines:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
- !''\*((((**_+))%%%++)(%%%%).1_**-+\*''))\*\*55CCF>>>>>>CCCCCCC65
FASTQ files are widely used in Next-Generation Sequencing (NGS) technologies such as Illumina, SOLiD, and Ion Torrent.
FASTA format is a simple and widely used format for representing nucleotide sequences (DNA, RNA) or protein sequences. A FASTA file starts with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol at the beginning.
>SEQ_ID Description of the sequence
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
It's worth noting that the FASTA format does not contain quality scores, which is a major difference from the FASTQ format. FASTA files are frequently used in genome assemblies and gene prediction methods, as well as in sequence alignment and homology searches.
The Binary Alignment/Map (BAM) format is a binary, compressed representation of sequence alignment data. BAM files can store aligned sequences from high-throughput sequencing technologies, making them useful for representing sequence reads aligned to reference genomes. They can handle large amounts of data efficiently, which is essential in the high-throughput era.
Visualization of BAM data as text from tools like Samtools
seq1 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * NM:i:1 MD:Z:8G4^C3
seq2 163 ref 9 30 3S6M1P1I4M = 39 39 AAAAGATAAGGATA * NM:i:0 MD:Z:10
A BAM file includes a header section and an alignment section. The header contains information about the reference sequences and the alignment process, while the alignment section contains the alignment information for individual sequence reads.
A BAM Index file (BAI) accompanies a BAM file. It's a binary file that provides quick access to the alignment data for a region of the genome in the corresponding BAM file. This feature is useful when working with large data sets, as it allows researchers to access specific genomic regions without having to scan the entire file. Using a BAI file, researchers can quickly retrieve all reads aligned to a particular region, making it invaluable for tasks such as visualizing data in a genome browser or extracting data from targeted genomic regions.
Chromosome 1: -----------------------------
BAI pointers: ^ ^ ^
Chromosome 2: -------------------------------
BAI pointers: ^ ^ ^ ^
BAM file: |--------|--------|--------|--------|--------|
In this diagram:
SAM (Sequence Alignment/Map) is a tab-delimited text format designed for storing biological sequences aligned to a reference sequence. It's essentially the human-readable version of the binary BAM format. It consists of a header and an alignment section.
The header section starts with '@' and includes information such as the reference sequence names and lengths, the programs used for alignment, and the sequencing platform. The alignment section, on the other hand, contains information about each read and its alignment to the reference.
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * NM:i:1 MD:Z:8G4^C3
Here is what the columns represent:
The SAM format can get quite complex due to the number of optional fields and the bitwise FLAG field, which can represent several attributes of the read in binary. The CIGAR string is a compact representation of the alignment of the read to the reference genome, where 'M' denotes match or mismatch, 'I' denotes insertion, and 'D' denotes deletion.
This entry represents just a single sequence read. A SAM file can contain millions or even billions of such entries, often accompanied by a header section with information about the sequencing run and alignment.
VCF (Variant Call Format) is a text file format for storing gene sequence variations. VCF files are primarily used in bioinformatics for representing SNP, indel, and structural variation data.
A VCF file includes meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.
##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##contig=<ID=20,length=63025520,assembly=B37>
#CHROM POS ID REF ALT QUAL FILTER INFO
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017
Let's break this down:
Remember, VCF files can become quite complex, especially in the INFO column and when dealing with multiple samples (additional columns beyond INFO). This snippet is a minimal example to give you the basic idea.
GFF (General Feature Format) and GTF (Gene Transfer Format) are both file formats used for describing genes and other features of DNA, RNA, and protein sequences. The formats consist of one line per feature, each containing nine columns. The columns are "seqname", "source", "feature", "start", "end", "score", "strand", "frame", and "attribute". A GTF file is essentially a specific type of GFF, often used in conjunction with genome annotation/assembly.
##gff-version 3
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=mygene
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=mRNA1
ctg123 . exon 1050 1500 . + . ID=exon00001;Parent=mRNA00001
ctg123 . exon 3000 3902 . + . ID=exon00002;Parent=mRNA00001
ctg123 . three_prime_UTR 5000 9000 . + . ID=three_prime_UTR00001;Parent=mRNA00001
Let's break down the columns:
This GFF3 file describes a gene located on ctg123 from position 1000 to 9000 on the positive strand. The gene has an mRNA, with exons and a 3' UTR specified. Note that the example is highly simplified, and real GFF3 files can be much more complex, especially in the attributes (9th) field.
The BED (Browser Extensible Data) format is a flexible, column-based format for defining data lines that are displayed in an annotation track. BED files are used in a variety of tasks, such as finding significant overlap between large datasets, and visualizing data in genome browsers.
BED files have three required fields - chromosome, start position, and end position, and nine additional optional fields. The optional fields allow for detailed information about the feature, such as its name, score, strand, etc.
chr1 1300 9000 feature1 0 +
chr1 1350 2000 feature2 0 -
chr2 3000 3902 feature3 0 +
chr2 5000 6000 feature4 0 -
Here's what the columns represent:
In its simplest form, a BED file requires only the first three fields. Additional fields can be added for more complex data. In addition, the use of track and browser lines can provide further customization for display in genome browsers.
BedGraph files are designed to represent continuous data along the genome, such as signal intensities or coverage levels. They provide a way to visualize and store numerical values at each position in the genome. A BedGraph file typically consists of four columns: chromosome, start position, end position, and value. The values in BedGraph files can be positive, negative, or zero, representing features like read coverage, gene expression levels, or ChIP-seq signal intensities.
track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr1 1000 2000 0.25
chr1 2000 3000 0.50
chr1 3000 4000 0.75
chr2 1000 2000 1.00
chr2 2000 3000 0.85
Here's what the columns represent:
The first line of a BEDGraph (beginning with the word track) is the track definition line, and this provides configuration settings for the display of this track. It includes details like track type, name, description, visibility, color, and priority.
BigWig files are a binary file format commonly used in bioinformatics for efficient storage and retrieval of large-scale genomic data, such as signal intensities, coverage tracks, or other quantitative measurements. BigWig files are primarily designed for visualization and analysis in genome browsers and other genome data analysis tools.
BigWig files are compressed and indexed, allowing for fast random access to specific genomic regions. They provide an efficient representation of continuous numerical data across the genome and allow for zooming in and out of different genomic scales without the need to load the entire dataset into memory. BigWig files are compatible with popular genome browsers like UCSC Genome Browser and Integrative Genomics Viewer (IGV).
The Protein Data Bank (PDB) format is used to store three-dimensional data of proteins and nucleic acids. This format is widely used in the fields of molecular modeling, structural bioinformatics, protein design, drug discovery, and more. A PDB file consists of several sections providing different types of data, including information about atoms, connectivity, sequences, and crystallographic structure.
HEADER ALANINE 10-MAY-23
TITLE EXAMPLE STRUCTURE OF ALANINE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: ALANINE;
COMPND 3 CHAIN: A;
COMPND 4 ENGINEERED: YES;
ATOM 1 N ALA A 1 -0.677 0.000 0.000 1.00 20.00 N
ATOM 2 CA ALA A 1 0.603 0.000 0.000 1.00 20.00 C
ATOM 3 C ALA A 1 1.273 1.212 0.000 1.00 20.00 C
ATOM 4 O ALA A 1 0.603 2.212 0.000 1.00 20.00 O
ATOM 5 CB ALA A 1 1.273 -0.788 1.212 1.00 20.00 C
ATOM 6 H ALA A 1 -1.193 -0.788 -0.515 1.00 20.00 H
ATOM 7 HA ALA A 1 0.603 -0.788 -0.515 1.00 20.00 H
ATOM 8 HB1 ALA A 1 0.603 -0.788 2.212 1.00 20.00 H
ATOM 9 HB2 ALA A 1 1.943 -1.576 1.212 1.00 20.00 H
ATOM 10 HB3 ALA A 1 1.943 0.000 1.727 1.00 20.00 H
END
Let's break this down:
Please note that PDB files can become quite complex when dealing with large proteins, and may also contain additional information such as secondary structure, connectivity, and crystallographic data. This snippet is a minimal example to give you the basic idea.
Understanding these file formats is a fundamental part of working in high-throughput research. Each file type has unique attributes and is used in different contexts depending on the type of analysis you are performing. Familiarity with these formats enables efficient handling, processing, and analysis of high-throughput data, paving the way for insightful biological discoveries. Remember, high-quality data analysis begins with understanding your data at its most basic level – its format. So next time you encounter a FASTQ, FASTA, BAM, or BAI file, you’ll know exactly what it contains and how best to use it in your research.