NGS Data Trimming and Filtering with Trimmomatic
Trimmomatic serves as a versatile preprocessing tool tailored for the trimming and filtering needs of Illumina next-generation sequencing (NGS) data. It excels in handling paired-end data accurately and efficiently, standing out for its adaptability and performance across various scenarios. Developed by Anthony M. Bolger, Marc Lohse, and Bjoern Usadel, Trimmomatic emerged to meet the demand for a tool that offers flexibility, precise handling of paired-end data, and robust performance.
Key Features
Trimmomatic boasts several key features:
- Adapter Removal: It effectively detects and eliminates adapter sequences from reads, a crucial step for ensuring the accuracy of subsequent analyses.
- Quality Filtering: Trimmomatic provides two primary methods for quality filtering, utilizing Illumina quality scores to determine optimal trimming points.
- Sliding Window Quality Filtering: This method scans from the 5' end of the read and trims the 3' end when the average quality drops below a specified threshold.
- Maximum Information Quality Filtering: An alternative approach focusing on quality-based trimming.
Trimmomatic operates under the GPL V3 license and is cross-platform, requiring Java 1.5 or higher. It can be downloaded from the Usadel Lab website.
Installation
Installing Trimmomatic is straightforward:
- Download the latest version from the official website.
- Unzip the downloaded file to a directory of choice. Ensure Java Runtime Environment (JRE) is installed and accessible from the command line.
Once installed, Trimmomatic can be invoked from the command line using the `java -jar` command followed by the path to the Trimmomatic jar file and appropriate options.
JRE Installation
Download and install a suitable 64-bit JRE, ensuring that the Java application is in your path. On Linux, you can install Java using package managers such as apt or yum.
Ubuntu / Mint:
sudo apt install default-jre
CentOS / Redhat:
sudo yum install java-1.8.0-openjdk
You can check whether Java is installed by opening the ‘cmd’ program on Windows, or any shell on Linux and typing java -version
.
You should see something like:
>java -version
openjdk version "11.0.2" 2019–01–15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.2+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.2+9, mixed mode)
Quick Start
To begin using Trimmomatic:
- Prepare input files, typically in FASTQ format.
- Execute Trimmomatic using a command like the example provided, which specifies input and output files, sets quality scores, and includes various trimming options.
java -jar trimmomatic-0.39.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
This instruction delineates the input and output files for both forward and reverse reads, establishing a Phred quality score of 33. Additionally, it incorporates a range of trimming options, encompassing adapter elimination, removal of low-quality bases at the beginning and end of reads, sliding window trimming, and application of a minimum length filter.
Some Popular Commands
Several common commands for preprocessing NGS data with Trimmomatic include:
Adapter Trimming
java -jar trimmomatic-0.39.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:adapter_sequences.fa:2:30:10
This directive employs the ILLUMINACLIP feature to eliminate adapter sequences. Here, adapter_sequences.fa represents the file housing the adapter sequences, while 2 denotes the seed mismatches, 30 stands for the palindrome clip threshold, and 10 signifies the simple clip threshold.
Quality Trimming
java -jar trimmomatic-0.39.jar SE -phred33 input.fq.gz output_trimmed.fq.gz LEADING:20 TRAILING:20
This instruction involves the trimming of low-quality bases at the beginning (LEADING) and end (TRAILING) of the reads, employing a quality threshold of 20.
Sliding Window Trimming
java -jar trimmomatic-0.39.jar SE -phred33 input.fq.gz output_trimmed.fq.gz SLIDINGWINDOW:4:15
With the SLIDINGWINDOW option, trimming occurs when the average quality within a 4-base window drops below 15.
Minimum Length Filtering
java -jar trimmomatic-0.39.jar SE -phred33 input.fq.gz output_trimmed.fq.gz MINLEN:36
MINLEN removes reads that are shorter than the defined length after trimming, which is 36 bases in this instance.
Cropping
java -jar trimmomatic-0.39.jar SE -phred33 input.fq.gz output_trimmed.fq.gz CROP:75
The CROP option trims the read to a designated length, in this case, 75 bases, without considering the quality.
Trimmomatic effectively prepares NGS data for downstream analysis, offering a wide array of options to ensure high-quality results. Its versatility and efficiency make it a preferred tool for bioinformaticians working with Illumina sequencing data.