Automating RNA-Seq Downstream Analysis with DESeq2 and ClusterProfiler
In the realm of transcriptomics, RNA-Seq has become a powerful tool for understanding gene expression patterns. However, the journey from raw data to meaningful biological insights involves several intricate steps of data processing and analysis. In this tutorial, we will walk through a comprehensive RNA-Seq downstream analysis workflow using DESeq2 and ClusterProfiler, split into distinct stages for clarity and automation.
Prerequisites
- Basic knowledge of RNA-Seq data and its analysis
- Familiarity with the R programming language
- Proficiency in using the command-line interface
- Understanding of biological concepts related to gene expression and enrichment analysis
Tools and Packages
- DESeq2: A popular R package for differential gene expression analysis
- ClusterProfiler: An R package for functional enrichment analysis of gene clusters
- Python: For data preprocessing and manipulation
- Bash: For automation and workflow orchestration
In a previous tutorial, we covered upstream RNASeq processing involving read quality control using FastQC and MultiQc, read alignment using Hisat2, and read quantification using featureCounts from the subread package. The result is a table containing gene expression data. In this tutorial, we will perform statistical analysis and functional enrichment analysis to find statistically significant genes that are differentially expressed and which pathways are enriched by those genes. If you have not seen the previous tutorial on upstream analysis, here is the link.
About Data
Gene expression data can be largely found in the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo). In this database, there are both raw reads (in fastq formats) and processed gene expression data. You can download any processed gene expression data that has raw counts. But I will be using the results obtained from the upstream analysis I performed in a previous tutorial.
Just as a recap, the dataset used is the “RNAseq in Alzheimer’s Disease patients” with accession GSE53697. You can choose to click on the link and download that raw data as it will offer you more challenges in the preprocessing part of this workflow. Or you can choose to go with this data (raw count data, metadata) obtained from the previous tutorial.
Please note that the counts in this dataset do not reflect real data as only a small fraction of the reads were sampled before the upstream workflow. So the results from this analysis are not meant for any sort of scientific reference.
Workflow Overview
- Preprocessing: Clean and match count data with metadata
- DESeq2 Analysis: Perform differential gene expression analysis
- Functional Enrichment Analysis: Identify enriched gene sets and pathways
Let’s dive into each stage of the workflow:
1. Preprocessing
The metadata is prepared manually before the upstream analysis, and the reads are renamed to have a certain naming pattern that matches the sample names in the metadata. This must be done manually as several datasets have differences in their data organization. Once metadata and reads are well prepared, the rest of the processes can be run automated.
For the preprocessing step, we’ll write a Python script to clean and match count data with metadata. This script will perform tasks such as renaming columns, removing duplicates, and filtering genes with zero counts. Let’s create a Python script for this task:
#!/usr/bin/env python
import argparse
import pandas as pd
def process_count_data(metadata_path, counts_data_path, output_path):
# load count data and metadata
metadata_df = pd.read_csv(metadata_path)
count_data_df = pd.read_csv(counts_data_path, sep="\t")
# rename columns
count_data_df.columns = ["geneSymbol"] + [name.split("/")[-1].split(".")[0] for name in count_data_df.columns[1:]]
# Assume 'sampleAccession' column in metadata contains the sample names in the desired order
desired_order = metadata_df['sampleAccession'].tolist()
# Reorder the columns of count data DataFrame according to the desired order
count_data_df = count_data_df.reindex(columns=["geneSymbol"]+desired_order)
# Remove duplicates based on GeneSymbol column
count_data_df = count_data_df.drop_duplicates(keep='first')
count_data_df = count_data_df.drop_duplicates(subset='geneSymbol', keep='first')
# Remove missing values
count_data_df = count_data_df.dropna().reset_index(drop=True)
# Remove genes with zero values across all samples
count_data_df = count_data_df.loc[:, (count_data_df != 0).any(axis=0)]
# Remove genes with negative values
count_data_df = count_data_df.loc[(count_data_df.iloc[:,1:] >= 0).all(axis=1), :]
# save count data
count_data_df.to_csv(output_path, index=False)
def main():
parser = argparse.ArgumentParser(description="Process count data and metadata")
parser.add_argument("metadata", help="Path to the metadata CSV file")
parser.add_argument("counts_data", help="Path to the counts data TXT file")
parser.add_argument("output", help="Path to save the processed count data CSV file")
args = parser.parse_args()
process_count_data(args.metadata, args.counts_data, args.output)
if __name__ == "__main__":
main()
This script takes three arguments: metadata file path, counts data file path, and output file path. It processes the count data and metadata, removes duplicates and genes with zero values across all samples, and saves the processed count data.
Now, let’s automate this preprocessing step in a bash script:
#!/bin/bash
echo "Starting preprocessing…"
python post_upstream_processing.py data/metadata.csv data/counts_data.txt data/processed_count_data.csv
echo "Preprocessing completed."
This bash script executes the Python preprocessing script and provides necessary file paths as arguments.
2. DESeq2 Analysis
DESeq2 is a powerful R package widely used for differential gene expression analysis in RNA-Seq data. It employs robust statistical methods to identify genes that are differentially expressed between conditions, accounting for variability and sample-specific effects. The workflow typically involves normalization, estimation of dispersion, statistical testing, and visualization of results.
For DESeq2 analysis, we’ll write an R script that performs the analysis and generates relevant plots. This script will take count data and metadata as inputs and output differential gene expression results.
#!/usr/bin/env Rscript
# differential_gene_expression_analysis.R
require(optparse)
#Parse arguments from command line
options <- list(
make_option(c(" - input_count_data_path"), action = "store", type = "character", help="Path to the raw count data."),
make_option(c(" - metadata_path"), action = "store", type = "character", help="Path to the metadata."),
make_option(c(" - outdir"), action = "store", default = "data", type = "character", help="Output directory for storing differential gene expression analysis results."),
make_option(c(" - factor_level1"), action = "store", default = "Type", help="Class level 1 column name."),
make_option(c(" - factor_level2"), action = "store", default = "Type", help="Class level 2 column name.")
)
arguments <- parse_args(OptionParser(option_list = options))
# Check if mandatory arguments are provided
if (is.null(arguments$input_count_data_path) || is.null(arguments$metadata_path)) {
stop("Error: Mandatory arguments - input_count_data_path and - metadata_path are required.")
}
# create variable handles for parsed parameters
input_count_data_path <- arguments$input_count_data_path
metadata_path <- arguments$metadata_path
outdir <- arguments$outdir
factor_level1 <- arguments$factor_level1
factor_level2 <- arguments$factor_level2
# Load DESeq2 library
library(DESeq2);
library(dplyr);
library(pheatmap)
library(ggplot2)
library("RColorBrewer")
library(ggrepel)
# create output directory
outdir <- paste0(outdir, "/dge")
dir.create(outdir)
# Load count data
count_data <- read.table(input_count_data_path, header = TRUE, row.names = 1, sep = ",")
colnames(count_data)
# load metadata
metadata <- read.table(metadata_path, header = TRUE, row.names = 1, sep = ",")
all(rownames(metadata) %in% colnames(count_data))
all(rownames(metadata) == colnames(count_data))
# set factor levels
factors <- factor(metadata[[factor_level1]])
metadata[[factor_level1]] <- factors
# Create a deseq object and import the count data and metatdata
dds <- DESeqDataSetFromMatrix(countData = round(count_data),
colData = metadata,
design = ~ Type)
# set reference for group factors
dds[[factor_level1]] <- relevel(dds[[factor_level1]], ref = "healthy control")
# filter out low count genes
# keep genes with at least N counts >= 10, where N = size of smallest group
keep <- rowSums(counts(dds) >=10) >= min(table(metadata[[factor_level1]]))
dds <- dds[keep,]
# Perform Statistical Test using Deseq
dds <- DESeq(dds, test = "Wald")
# get results of Deseq
deseq_result <- results(dds, cooksCutoff = FALSE, independentFiltering = FALSE)
# process results
deseq_result_df <- as.data.frame(deseq_result)
deseq_result_df <- cbind(GeneName = row.names(deseq_result_df), deseq_result_df)
# save results
write.table(deseq_result_df, file = paste0(outdir, "/Deseq2-results-all.tsv"), row.names = F, sep="\t")
# extract genes with padj < 0.05 and log2Foldchange <= -1 or >= 1
deg <- subset(deseq_result, padj<0.05 & abs(log2FoldChange) >=1)
# sort by adjusted p-values
deg <- deg[order(deg$padj),]
# save filtered results
write.table(deg, file = paste0(outdir, "/Deseq2-results-filtered.tsv"), row.names = F, sep = "\t")
#########################################
# Gene expression data visualization
#########################################
################################################################################
# Start MA Plot
resLFC <- lfcShrink(dds, coef=resultsNames(dds)[2], type="apeglm") # remove noise
# Save the plot as a PDF/PNG file
pdf(paste0(outdir, "/MA_plot.pdf"), width = 8, height = 6)
plotMA(deseq_result, ylim = c(-2, 2))
dev.off()
png(paste0(outdir, "/MA_plot.png"), width = 480, height = 360)
plotMA(deseq_result, ylim = c(-2, 2))
dev.off()
pdf(paste0(outdir, "/MA_plot_denoised.pdf"), width = 8, height = 6)
plotMA(resLFC, ylim=c(-2,2))
dev.off()
png(paste0(outdir, "/MA_plot_denoised.png"), width = 480, height = 360)
plotMA(resLFC, ylim=c(-2,2))
dev.off()
# End
################################################################################
################################################################################
# Start Dispersion plot
pdf(paste0(outdir, "/Dispersion_plot.pdf"), width = 8, height = 6)
plotDispEsts(dds)
dev.off()
png(paste0(outdir, "/Dispersion_plot.png"), width = 480, height = 360)
plotDispEsts(dds)
dev.off()
# End
################################################################################
################################################################################
# Start volcano plots # set groups
old.pal <- palette(c("#FF0000", "#0000FF", "#808080")) # set colors
par(mar=c(4,4,2,1), cex.main=1.5) # set margin size
pdf(paste0(outdir, "/Volcano_plot.pdf"), width = 8, height = 6)
#plot values
plot(deseq_result$log2FoldChange, -log10(deseq_result$padj), col = "darkgray",
xlab="log2FoldChange", ylab="-log10(Padj)", pch=20, cex=0.5)
with( subset(deseq_result, padj <0.05 & abs(log2FoldChange) >=1),
points(log2FoldChange, -log10(padj), pch =20, col=(sign(log2FoldChange) +3)/2, cex =1))
legend("bottomleft", title=paste("Padj<", 0.05,sep=""),
legend=c("down regulated", "up regulated", "unchanged"), pch=20, col=1:3)
dev.off()
resLFC_df <- as.data.frame(deseq_result_df)
resLFC_df$diffexpressed <- "no change"
resLFC_df$diffexpressed[resLFC_df$log2FoldChange>1 & resLFC_df$padj<0.05] <- "up regulated"
resLFC_df$diffexpressed[resLFC_df$log2FoldChange< (-1) & resLFC_df$padj<0.05] <- "down regulated"
resLFC_df$diffexpressed[resLFC_df$padj>=0.05] <- "insignificant"
resLFC_df$delabel <- NA
volcano2 <- ggplot(data = resLFC_df, aes(x=log2FoldChange, y=-log10(padj), col=diffexpressed,label=GeneName))+
geom_point()+
theme_minimal()+
geom_text_repel()+
scale_color_manual(values=c("red","black","darkgray","blue"))+
theme(text=element_text(size=10))
ggsave(filename = paste0(outdir, "/Volcano_plot2.png"), plot = volcano2 , bg = "white",device = "png", width = 8, height = 6)
ggsave(filename = paste0(outdir, "/Volcano_plot2.pdf"), plot = volcano2 , device = "pdf", width = 8, height = 6)
# End
#############################################################################
This script reads the processed count data and metadata, performs DESeq2 analysis, and generates relevant plots.
3. Functional Enrichment Analysis
In the subsequent phase of the analysis, another R script is crafted to execute functional enrichment analysis employing ClusterProfiler. This script is pivotal in unveiling biological insights embedded within the identified differentially expressed genes (DEGs). The first step entails inputting the list of DEGs obtained from the previous DESeq2 analysis, serving as the foundation for gene set enrichment analysis (GSEA). Subsequently, enriched gene sets are scrutinized through diverse visualization techniques, including bar charts and dot plots. These plots offer a comprehensive overview of the distribution and significance of enriched gene sets, facilitating the identification of key biological pathways.
Moreover, the analysis delves deeper to pinpoint significantly enriched pathways, such as those elucidated by the Kyoto Encyclopedia of Genes and Genomes (KEGG). These pathways represent interconnected networks of genes and biochemical reactions, shedding light on the underlying biological processes at play. Finally, the enriched pathways are interpreted within the context of the experimental conditions, unraveling their biological significance and implications. This interpretative step is crucial for elucidating the molecular mechanisms driving observed gene expression changes and discerning their relevance in the broader biological context of the study.
Automating Functional Enrichment Analysis
For the functional enrichment analysis using ClusterProfiler, we will write an R script that takes the results from the DESeq2 analysis and performs gene set enrichment and KEGG pathway analysis. Here’s an example R script for this task:
#!/usr/bin/env Rscript
# gene_set_enrichment_analysis.R
require(optparse)
#Parse arguments from command line
options <- list(
make_option(c("--deseq2_results_path"), action = "store", type = "character", help="Path to the input data. Output from Deseq2"),
make_option(c("--outdir"), action = "store", default = "data", type = "character", help="Output directory."),
make_option(c("--geneNamingTYpe"), action = "store", default = "ALIAS", help="Which gene naming convention is used. e.g ALIAS, ENSEMBL, etc"),
make_option(c("--kegg_organism_db"), action = "store", default = "org.Hs.eg.db", help="Kegg organism database."),
make_option(c("--kegg_organism"), action = "store", default = "hsa", help="Kegg code for given organism.")
)
arguments <- parse_args(OptionParser(option_list = options))
# Check if mandatory arguments are provided
if (is.null(arguments$deseq2_results_path)) {
stop("Error: Mandatory arguments --deseq2_results_path is required.")
}
# create variable handles for parsed parameters
deseq2_results_path <- arguments$deseq2_results_path
outdir <- arguments$outdir
geneNamingTYpe <- arguments$geneNamingTYpe
kegg_organism <- arguments$kegg_organism
kegg_organism_db <- arguments$kegg_organism_db
# load required packages
library(ggplot2)
library(clusterProfiler)
library(enrichplot)
library(kegg_organism_db, character.only = TRUE)
library('cowplot')
# # set working directory
# setwd(dirname(dirname(rstudioapi::getActiveDocumentContext()$path)))
# create output directories
dir.create(paste0(outdir, "/gsea"))
dir.create(paste0(outdir, "/kegg"))
#############################
# Load and Prepare Input Data
#############################
# reading in data from deseq2
df = read.csv(deseq2_results_path, header=TRUE, sep = "\t")
# we want the log2 fold change
original_gene_list <- df$log2FoldChange
# name the vector
names(original_gene_list) <- df$GeneName
# omit any NA values
gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)
#############################
# Gene Set Enrichment
#############################
# perform gene set enrichment analysis
gse <- gseGO(geneList=gene_list,
ont ="ALL",
keyType = geneNamingTYpe,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = kegg_organism_db,
pAdjustMethod = "none")
# save enrichment results
write.table(gse, file = paste0(outdir,"/gsea/gene-set-enrichment-results.tsv"), row.names = F, sep="\t")
###########################################
# Gene Set Enrichment Results Visualization
###########################################
# import the dose package
require(DOSE)
# Dot plot of enriched terms
dotplot1 <- dotplot(gse, showCategory=20, split=".sign", font.size =7, color = "p.adjust",label_format = 100)
ggsave(filename = paste0(outdir, "/gsea/gseGO-dotplot-enriched-terms.png"), plot = dotplot1 , bg = "white",device = "png", width = 8, height = 6)
ggsave(filename = paste0(outdir, "/gsea/gseGO-dotplot-enriched-terms.pdf"), plot = dotplot1 , device = "pdf", width = 8, height = 6)
################################################################################
# KEGG Gene Set Enrichment Analysis
################################################################################
# Convert gene IDs for gseKEGG function. We will lose some genes here because not all IDs will be converted
ids<-bitr(names(original_gene_list), fromType = geneNamingTYpe, toType = "ENTREZID", OrgDb=kegg_organism_db)
# remove duplicate IDS
dedup_ids = ids[!duplicated(ids[c(geneNamingTYpe)]),]
# Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above
df2 = df[df$GeneName %in% dedup_ids[[geneNamingTYpe]],]
# Create a new column in df2 with the corresponding ENTREZ IDs
df2$Y = dedup_ids$ENTREZID
# Create a vector of the gene unuiverse
kegg_gene_list <- df2$log2FoldChange
# Name vector with ENTREZ ids
names(kegg_gene_list) <- df2$Y
# omit any NA values
kegg_gene_list<-na.omit(kegg_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)
# create a gseKEGG object
kk2 <- gseKEGG(geneList = kegg_gene_list,
organism = kegg_organism,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "none",
keyType = "ncbi-geneid")
# save kegg pathway analysis results
write.table(kk2, file = paste0(outdir,"/kegg/kegg-analysis-results.tsv"), row.names = F, sep="\t")
# Dot plot of enriched terms kegg
dotplot1 <- dotplot(kk2, showCategory=20, split=".sign", font.size =7, label_format = 100)
ggsave(filename = paste0(outdir, "/kegg/gseKEGG-dotplot-enriched-terms.png"), plot = dotplot1 , bg = "white",device = "png", width = 8, height = 6)
ggsave(filename = paste0(outdir, "/kegg/gseKEGG-dotplot-enriched-terms.pdf"), plot = dotplot1 , device = "pdf", width = 8, height = 6)
This script reads the statistical test results from DESeq2, performs gene set enrichment analysis, and generates relevant plots.
Explanation of the Automation Script
#!/bin/bash
# Display message indicating post-upstream processing is starting
echo "Starting post-upstream processing..."
# Run post-upstream processing Python script
python scripts/post-upstream-processing.py data/metadata.csv data/counts_data.txt data/count_data.csv
# Display message indicating post-upstream processing is done
echo "Post-upstream processing done."
# Display message indicating differential gene expression analysis is starting
echo "Starting differential gene expression analysis..."
# Run differential gene expression analysis script
Rscript scripts/differential-gene-expression-analysis.R --input_count_data_path "data/count_data.csv" --metadata_path "data/metadata.csv"
# Display message indicating differential gene expression analysis is done
echo "Differential gene expression analysis done."
# Display message indicating gene set enrichment analysis is starting
echo "Starting gene set enrichment analysis..."
# Run gene set enrichment analysis script
Rscript scripts/gene-set-enrichment-analysis.R --deseq2_results_path data/dge/Deseq2-results-all.tsv --geneNamingTYpe "ENSEMBL"
# Display message indicating gene set enrichment analysis is done
echo "Gene set enrichment analysis done."
- Post-Upstream Processing: The script starts by calling the Python script for post-upstream processing of the count data. It takes metadata and raw count data as inputs and generates processed count data.
- Differential Gene Expression Analysis: Next, the script runs the R script for differential gene expression analysis using DESeq2. It takes the processed count data and metadata as inputs and performs DESeq2 analysis to identify differentially expressed genes.
- Gene Set Enrichment Analysis: Finally, the script executes the R script for gene set enrichment analysis using ClusterProfiler. It takes the DESeq2 results as input and performs functional enrichment analysis to identify enriched gene sets and pathways.
Conclusion
In this tutorial, we have demonstrated a step-by-step RNA-Seq downstream analysis workflow using DESeq2 and ClusterProfiler, with a focus on automation. By following this guide, researchers can efficiently analyze their RNA-Seq data, identify differentially expressed genes, and gain insights into the underlying biological processes. Automating the workflow enhances reproducibility and scalability, enabling researchers to focus on the interpretation of results and biological discoveries.
Stay tuned as next, we will explore how to automate a complete end-to-end workflow with Nextflow. Also, keep in touch as in the future I will be working on several case studies where we will examine the results of each of these analyses in detail.
Access full code for this tutorial here.
References
- DESeq2 tutorial: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
- ClusterProfiler tutorial: https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/