As a wet lab biologist working with next-generation sequencing (NGS) data, you’re likely dealing with large amounts of gene expression data. While identifying differentially expressed genes (DEGs) is a common first step, understanding the biological context of these genes can be challenging. This is where pathway analysis comes in, providing insights into how your genes are involved in specific biological processes, pathways, or molecular networks. In this post, we’ll explain two popular approaches for pathway analysis: Gene Set Enrichment Analysis (GSEA) and Over-Representation Analysis (ORA). We’ll walk you through what they are, how they work, and how to decide which one is right for your experiment.
What Is Pathway Analysis, and Why Should You Use It?
At its core, pathway analysis is a method for interpreting the biological significance of large-scale gene expression data by linking genes to known biological pathways. These pathways represent collections of genes that work together to perform specific biological functions. For instance, a pathway might describe genes involved in immune response, cell cycle regulation, or DNA repair.
By using pathway analysis, you can:
- Identify biological processes that are activated or suppressed in your experimental conditions.
- Gain insights into the molecular mechanisms underlying your experimental outcomes.
- Prioritize further research by pinpointing key biological pathways that are most likely involved in your study’s results.
But how do you find out which pathways are relevant to your data? That's where tools like GSEA and ORA come in!
Gene Set Enrichment Analysis (GSEA)
GSEA is an advanced and widely used method for pathway analysis, especially when you have a list of genes ranked by their expression changes between conditions (e.g., treated vs. untreated). Instead of looking only at individual genes, GSEA focuses on whether predefined sets of genes (called gene sets) from a particular biological pathway or process are enriched at the top or bottom of your ranked gene list.
How Does GSEA Work?
GSEA works by evaluating how genes from a specific biological pathway are distributed across a ranked list of all genes based on their expression changes. Instead of just focusing on individual genes, GSEA looks at the overall distribution of pathway genes within the ranked list to assess whether those pathways are more active or suppressed in your data.
-
Rank the Genes
Genes are ranked based on the magnitude of their differential expression (how much they change between your experimental conditions, like treated vs. untreated). The most upregulated genes are at the top, and the most downregulated genes are at the bottom. -
Evaluate Pathway Enrichment
GSEA checks if the genes associated with a particular pathway (e.g., immune response, cell cycle) are clustered together at either the top (highly upregulated) or the bottom (highly downregulated) of this ranked gene list. -
Calculate the Enrichment Score (ES)
GSEA computes an enrichment score (ES) that reflects the degree of concentration of pathway genes at the top or bottom of the ranked list. -
Normalize the Enrichment Score (NES)
The normalized enrichment score (NES) adjusts for differences in dataset size, so that results can be compared across different experiments.- A high positive NES indicates that the pathway is strongly upregulated. Genes in this pathway are mostly found at the top of the ranked gene list, suggesting activation of that pathway.
- A high negative NES indicates that the pathway is strongly downregulated. The pathway’s genes are concentrated at the bottom of the ranked gene list, suggesting that the pathway is suppressed.
- A low or zero NES suggests that the pathway isn’t significantly enriched.
How to Interpret the GSEA Enrichment Plot
The GSEA enrichment plot is a key visual tool that helps you interpret the NES and assess the significance of pathway enrichment. Here’s what you’ll see in a typical plot:
- X-axis: Shows your ranked list of genes (from most upregulated to most downregulated).
- Y-axis: Displays the running enrichment score (ES), which tells you how strongly the pathway’s genes are concentrated at the top or bottom of the ranked list.
- Green line: Represents the running sum of the ES as you move across the ranked gene list. The green line rises when pathway genes appear in the ranked list and decreases when they don’t.
For example, if you’re analyzing a pathway in treated cells compared to untreated control cells, and the green line peaks at the left (top of the ranked list), it means the pathway genes are highly upregulated in the treated group. A negative NES, with the peak at the right (bottom), would suggest that these genes are downregulated or suppressed in the treated cells.
In summary, GSEA allows you to assess whether entire biological pathways are activated or repressed, rather than just focusing on individual gene changes. It’s especially useful for interpreting large datasets where gene interactions and collective effects matter more than the behavior of individual genes.
Over-Representation Analysis (ORA)
While GSEA looks at the distribution of pathway genes in a ranked list, ORA offers a simpler approach. ORA tests whether specific gene sets (pathways) are over-represented among the DEGs identified in your dataset.
How Does ORA Work?
ORA compares the DEGs from your analysis with predefined gene sets (pathways) to determine if any of these gene sets are disproportionately represented compared to what would be expected by chance. Statistical tests are used to assess whether the overlap between your DEGs and pathway genes is significant.
-
Identify the DEGs
From your analysis, identify the genes that show significant differential expression between your experimental conditions (e.g., genes that are upregulated or downregulated). -
Check for Pathway Over-Representation
ORA examines each predefined gene set (pathway) to see if the DEGs are disproportionately represented in that pathway compared to a random selection of genes. -
Perform Statistical Testing
ORA uses statistical tests (like Fisher’s exact test or hypergeometric distribution) to calculate the probability (p-value) that the observed overlap between the DEGs and the pathway genes is significant. -
Interpret the Results
A significant p-value indicates that the pathway is over-represented in your DEGs and is likely biologically relevant. Conversely, a non-significant p-value suggests the pathway is not strongly implicated in your dataset.
ORA is often faster and more straightforward than GSEA but assumes that the genes in each pathway are independent of one another — an assumption that may not hold in complex biological systems.
Should You Use GSEA or ORA?
Both GSEA and ORA have their strengths and weaknesses, and the choice between them depends on your dataset and research question.
-
GSEA is ideal if:
- You have a ranked gene list and are interested in pathway-level analysis of all genes, not just DEGs.
- You suspect that biological pathways are globally upregulated or downregulated across your dataset, even if not all individual genes in the pathway are differentially expressed.
- You want to focus on pathway enrichment, rather than just looking at specific gene lists.
-
ORA is ideal if:
- You have a list of differentially expressed genes (DEGs) and want to see if they overlap with specific pathways.
- You are working with a smaller dataset or just need a quicker, more straightforward analysis.
- You don’t need to focus on the distribution of genes across the entire list, but instead just need to know which pathways are over-represented in your DEGs.
Can You Use Both?
In some cases, using both approaches together can provide a more complete picture of your data. GSEA might give you insights into pathways that are activated or repressed as a whole, while ORA could highlight specific pathways that are over-represented in your DEGs. Together, they can offer a broader understanding of the biological processes at play in your experiment.
Conclusion
Pathway analysis is a valuable tool for making sense of complex gene expression data by identifying key biological processes or pathways that are impacted by your experimental conditions. Both GSEA and ORA are effective approaches, with GSEA providing a more comprehensive view of pathway enrichment across your entire gene list, and ORA focusing on the over-representation of pathways among your DEGs. By understanding the strengths of each method, you can choose the one that best fits your research needs, or combine both for deeper insights.