Quality Control of High-Volume Sequencing Data with FastQC: A Complete Guide

High-throughput sequencing technologies have revolutionized genomics by enabling the generation of massive data volumes in a short time. However, with great power comes great responsibility, and ensuring the quality of the generated data is crucial. This is where FastQC comes into play.

FastQC analyzes a set of sequence files and generates a quality control report for each, comprising various modules to identify potential data issues. If no files are specified, it launches as an interactive graphical application; otherwise, it runs in non-interactive mode, suitable for standardized analysis pipelines.

FastQC, a Java-based application, offers modular analyses to swiftly highlight potential problems in sequence data. Its key functions include:

  1. Importing data from BAM, SAM, or FastQ files (in any variant).
  2. Providing an overview to detect potential issues.
  3. Generating summary graphs and tables for rapid data assessment.
  4. Exporting results to an HTML-based permanent report.
  5. Supporting offline operation for automated report generation.

FastQC is a robust quality control tool tailored for high-throughput sequence data analysis. Written in Java, it requires a compatible Java Runtime Environment (JRE) for operation. The tool utilizes the Picard BAM/SAM Libraries, conveniently included in the download package. FastQC is released under the GPL v3 or later license. Feedback for further improvements is encouraged, with Simon Andrews serving as the initial contact.

Installation

To begin using FastQC, ensure you have a suitable JRE installed. The tool includes the Picard BAM/SAM Libraries, eliminating the need for separate installations.

JRE Installation

Download and install a suitable 64-bit JRE, ensuring that the Java application is in your path. On Linux, you can install Java using package managers such as apt or yum.

Ubuntu / Mint:

sudo apt install default-jre

CentOS / Redhat:

sudo yum install java-1.8.0-openjdk

You can check whether Java is installed by opening the ‘cmd’ program on Windows, or any shell on Linux and typing java -version.

You should see something like:

                    >java -version
                  openjdk version "11.0.2" 2019–01–15
                  OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
                  OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.2+9, mixed mode)
                    
                  

FastQC Installation

You have the option to acquire FastQC directly from the Babraham Bioinformatics website . The installation steps are simple:

  1. Choose the appropriate version according to your operating system.
  2. Extract the downloaded file to any location you prefer.
  3. If you’re using Linux, you may need to ensure that the fastqc file is executable by executing chmod +x fastqc in your terminal.
  4. Finally, launch FastQC by executing the fastqc executable located within the unzipped directory.

Quick Start

Once FastQC is installed, conducting a basic quality assessment on your sequence data is straightforward. Here’s a step-by-step guide:

  1. Open a terminal window (or command prompt for Windows users).
  2. Navigate to the directory where your FastQ files are located.
  3. Execute the command fastqc data.fastq to initiate the analysis. (Note: SAM/BAM files can also be used).

FastQC will process the file and produce an HTML report, along with a compressed archive containing the report and related files. You can view the HTML report using any web browser to examine the results.

Below are examples of commonly used commands in FastQC for customization:

  1. Analyzing Multiple Files: If you have multiple FastQ files, you can analyze them simultaneously by listing them after the fastqc command, separated by spaces. For instance: fastqc file1.fastq file2.fastq file3.fastq.
  2. Specifying Output Directory: To designate a different directory for output files, utilize the -O option followed by the directory path. Example: fastqc -O /path/to/output/ data.fastq.
  3. Skipping ZIP File Creation: By default, FastQC generates a zipped file containing the report and data files. If you only require the HTML report, employ the --noextract option. For instance: fastqc --noextract data.fastq.
  4. Adjusting the Number of Threads: FastQC can process multiple files concurrently using multiple threads. Utilize the -t option followed by the desired thread count. Example: fastqc -t 4 file1.fastq file2.fastq.
  5. Running in Non-Interactive Mode: If integrating FastQC into a larger automated pipeline, running it in non-interactive mode may be preferable. Utilize the --nogroup option to deactivate the interactive grouping of bases for each sequence. Example: fastqc --nogroup data.fastq.

For a complete list of commands and options, refer to the official documentation or run fastqc -h in your terminal.

FastQC is an essential tool for anyone working with high-throughput sequencing data. By quickly assessing data quality, it ensures reliable downstream analyses, making it indispensable for both seasoned bioinformaticians and beginners alike.

For more details on interpreting FastQC results, visit the Quality Control and Trimming in Genomic Analysis guide by Simon Andrews.

Reference

Andrews, S. (Year). FastQC: A Quality Control Tool for High Throughput Sequence Data [Software]. Available from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/