SNIKT icon indicating copy to clipboard operation
SNIKT copied to clipboard

Sequence-independent identification and removal of adapters/systemic contamination in shotgun sequencing data. https://doi.org/10.1093/bioinformatics/btac389

SNIKT

install with bioconda Anaconda-Server Badge Anaconda-Server Badge

Snikt (Slice Nucleotides Into Klassifiable Tequences) is a program that reports a visual confirmation of adapter or systemic contamination in whole-genome shotgun (WGS) or metagenomic sequencing DNA or RNA reads and based on user input, trims sequence ends to remove them. It works without prior information about the adapter sequence making it applilcable even when this information is unavailable. The program is written in R which allows for nice visual reports and uses Seqtk in the back-end for handling of fastq data, which is a fast, lightweight tool written in C.

This program is most suitable for long, variable-length reads, for example, from the Oxford Nanopore instruments. This is because read end trimming for long reads does not have a significant impact on the overall read throughput post-cleaning. This program has also been modified to accommodate short, fixed-length (Illumina) reads for identification of adapter contamination, however, end-trimming on such datasets is disabled by default.

1. Installation

SNIKT is available on Unix-like platforms (Linux, Mac OS and WSL). The easiest way to use SNIKT, in our opinion, is to use the Conda package management for set up. However, being a wrapper written in R, this program can also be used with a local R installation given it has access to seqtk and some R libraries. Both of these setup options are explained below.

1.1 Installation via Bioconda

For this setup process to work, you will need to install the Conda package manager and setup Bioconda repositories. Instructions for setting up Bioconda can be found on the Bioconda installation page. Only setup steps 1 and 2 are needed for this. Once you have the conda set up and working, please start a new session.

Optional: Install and use mamba for conda management. Mamba is a reimplementation of the Conda package manager in C++ and has several benefits over using Conda. It works as a drop-in replacement for conda for most Conda commands except for environment activation commands. Full help and instructions can be viewed on the Mamba GitHub page, but for simplicity here is how you can install mamba with your conda installation.

$ conda install -n base -c conda-forge mamba

After this installation, mamba can be used in the next step (and for management of any other environments/packages). We recommend using mamba as the parallelization during installation helps speed up the process of getting several R libraries if your network and storage media allows.

Now SNIKT can be set up in a new environment with the following command.

$ mamba create -n snikt snikt

or

$ conda create -n snikt snikt

And that's it for the setup with this option. Activate the environment and feel the power of Adamantium blades slicing and dicing nucleotides!

$ conda activate snikt
$ snikt.R --version
SNIKT 0.4.0

If you ever need to uninstall, you can remove the environment with the following.

$ conda remove -n snikt --all

1.2 Using a local R installation

While the conda installation is easier, we recognize that many users may want to use their local R installation instead of pulling a fresh copy of R and its libraries as a different environment. So, to setup SNIKT in this way, you will need to first get the program seqtk from the Seqtk GitHub page. A slightly older version of Seqtk is also available in the apt package manager for Ubuntu Linux.

Next, you can install the packages in your R (we recommend R >= 4.0) library. To do this, you can do the following in your R terminal.

$ R
R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
...

> install.packages( c("tidyverse","gridExtra","docopt","lubridate"), dependencies = TRUE)

Now, that the dependencies are set, we can download SNIKT.

$ git clone https://github.com/piyuranjan/SNIKT.git

After this download, you can either put the downloaded folder on your system PATH or copy the snikt.R code to a location that is already on your PATH.

2. How to use SNIKT

For first time users, we recommend using this program via the interactive method. This program can also run in batch mode if a visual interface in unavailable, for example, on a remote server with SSH.

2.1 Interactive mode

This, in our opinion, is the easiest method of using this program.

$ snikt.R reads.fastq.gz

This will prepare a temporary graph using top 10K reads from your set and will prompt you to check for contamination. While you check the figure, the prompt will wait for you to pick a trim length. This procedure also implements a default read length filter. If you would like to change that filter, you should kill the execution (Ctrl+c, Enter) and rerun with appropriate options.

Once you pick trim lengths, the program trims reads, filters them by the length specified and exports them to a fastq file. After that, it prepares a summary report with the following elements.

Pre-QC Summary

This part of the summary report is similar to the temporary graph presented to the user for making a decision on trim lengths. It shows a bias in nucleotide compositions and Phred scores suggesting contamination at ends.

Post-QC Summary

This part of the summary report shows the effect of trimming and filtering on the reads. This is the report for the post-cleanup reads that were generated as the process of running this program.

2.2 Batch mode

The batch mode is useful when the user does not have the capacity to view the temporary graph while the program runs. This is true, for example, if this program is run as a job on a supercomputing cluster. In this scenario, this program can be run in two phases, where the first phase helps with identification of contamination and the second performs cleanup of reads.

$ snikt.R --notrim reads.fastq.gz

When running in this mode, the process can also be parallelized for a large library of fastq files with command wrapping in GNU parallel. For example, the following command will run the first phase on all fastq files in a directory using <n> threads.

$ parallel -vkj<n> snikt.R -n {} ::: *.fastq.gz

This step will disable any cleanup process and will prepare a pre-QC report with a 6 panel graph. While this decision is only made with top 10K reads by default, users can use the entire dataset by specifying option --skim=0. Use this report to make a decision about trim and filter lengths.

No-QC Summary

This report can also be tailored for Illumina reads with the --illumina preset option.

Once a decision is made, user can execute SNIKT again this time with pre-set trim and filter options.

$ snikt.R --trim5=120 --trim3=25 --filter=1000 reads.fastq.gz

Or if the trim criteria are same across all fastq, the command can be parallelized on <n> threads.

$ parallel -vkj<n> snikt.R -T 120 -t 25 -f 1000 {} ::: *.fastq.gz

This step will now clean up reads and will prepare a report pre- and post-cleanup similar to the one generated with the interactive mode. Note that the previous report file will be overwritten unless changed with the option -o.

3. Compatibility

This program has been tested on the following system environments:

  • Windows Subsystem for Linux running Ubuntu 20.04 LTS
  • Ubuntu 18.04 LTS
  • CentOS 7 running on UM's ARC High Performance Compute cluster with SLURM scheduler.

This program has been tested with the following dependencies and their versions. We recommend using listed versions or higher for these dependencies.

4. Full command line help

A full version of command line help is included here for ease of use. However, this version of the help documentation is not guaranteed to be as up to date as from the program itself. Whenever in doubt, please refer to the documentation produced by running $ snikt.R -h.

SNIKT: FastQ QC and sequence over-representation check.
       A wrapper around seqtk to plot per-position nucleotide composition
       for finding and trimming adapter contamination in fastq reads.
       Also filters reads by a length threshold.
Authors Piyush Ranjan, Christopher Brown

For first-time users, interactive mode is recommended.
For detailed help and examples, please visit
https://github.com/piyuranjan/SNIKT

Location: scriptPath

Usage:
  snikt.R [options] [--] <fastq>
  snikt.R <fastq>  # Interactive # Easiest
  snikt.R [--zoom5=<nt>] [--zoom3=<nt>] <fastq>  # Interactive
  snikt.R [(--trim5=<nt> --trim3=<nt>) | --notrim] <fastq>
  snikt.R [--illumina] [-n] <fastq>

Input:
  <fastq>               Sequence file in fastQ format with exts: .fq, .fq.gz,
                          .fastq, .fastq.gz

Options:
  Presets:
  --illumina            This presets options that are better for short-read
                          Illumina datasets.
                          Sets: -f 0 -Z 50 -z 50 --hide=0
                          Defaults are configured for long-read Nanopore fastq.

  Graphing:
  --hide=<frac>         Hide the composition tail by a fraction of total bases.
                          Significantly improves speed, removes end-tail (3')
                          distortion for variable length read sets.
                          [range: 0..1] [default:0.01]
  -s, --skim=<num>      Use top num reads for pre- or no-trim graphs. This
                          improves speed. No effect on post-trim graphs.
                          Use 0 to disable skimming and utilize all reads.
                          [range: 0..maxFastqReads] [default: 10000]
  -x, --xbreaks=<num>   Suggest number of ticks/breaks on x-axis in all graphs.
                          Can be set if the breaks or gridlines are too sparse
                          or dense to determine appropriate trimming. Internal
                          algorithm adjusts ticks.
                          [range: 1..10] [default: 6]
  -Z, --zoom5=<nt>      Zoom-in from aligned 5' beginning to nt bases.
                          [range: 1..maxSeqLen] [default:300]
  -z, --zoom3=<nt>      Zoom-in from aligned 3' ending to nt bases.
                          [range: 1..maxSeqLen] [default:100]
  QC:
  -f, --filter=<nt>     Filter (drop) reads with length < nt after any trimming.
                          [range: 0..maxSeqLen] [default:500]
  -n, --notrim          Disable positional trimming; useful for short-read data
                          Takes precedence over and sets: -T 0 -t 0
  -T, --trim5=<nt>      Trim nt bases from aligned 5' side.
                          [range: 0..(maxSeqLen-trim3)] [default: interactive]
  -t, --trim3=<nt>      Trim nt bases from aligned 3' side.
                          [range: 0..(maxSeqLen-trim5)] [default: interactive]
  IO:
  -o, --out=<prefix>    Prefix for output files [default: fastqNoExtension]
  -w, --workdir=<path>  Path to generate QC file and report. [default: ./]
  --gzip                If fastq file is gzipped. Autodetected normally
                          using the file extension. For large datasets, prior
                          decompression of fastq may be faster. Only gzip is
                          supported within; prior decompression needed for any
                          other method.
  -k, --keep            Keep intermediate (temporary) directory.

  Generic:
  -h, --help            Show help and exit 0.
  -v, --verbose         Enable status messages.
  --debug               Debug with traceback; enables -v.
  --version             Show version and exit 0.

5. FAQ

5.1 Why the name SNIKT?

We at the University of Michigan are a huge Wolverine fan. So we couldn't pass an opportunity to relate our program that involves identifying (sniffing) and trimming (slashing) nucleotides with our favorite hero. So despite our humble attempt at making a backronym, we have named our program by the sound of Wolverine's claws being drawn - a metallic noise always written as "SNIKT!".

SNIKT - Slice Nucleotides Into Klassifiable Tequences

Imagine the following:

Hey adapter, why don't you just back off...

                                                - Wolverine

And after a SNIKT! your read sequences are no longer sequences... they are tequences!

5.2 Can this method be used for marker gene (16S) read data?

While this method can technically be used over marker gene sequencing data like the 16S, it is not always easy to pick out sequence over-representation patterns. Still, the adapter sequences are so alike that in our tests we could see a difference in real 16S sequence representation and adapters. That said, the problem will be harder when the community would be dominated by fewer taxonomies with higher similarity in their V3-V4 region of the 16S generally sequenced for such studies. On the other side, most 16S analysis platforms like mothur or QIIME2 have their own protocols for removing read contamination which works well. Adapters also have less of an impact over marker gene sequencing because at some point in the analysis, users align it to a database of marker genes of interest. Analysis protocols can choose to not look at any segments of the read that did not align and completely circumvent any adapter issues. This is not feasible for metagenomic sequencing as it is randomized over entire lengths of genomes in a community.

6. Citation

Ranjan, P., Brown, C. A., Erb-Downward, J. R., & Dickson, R. P. (2022). SNIKT: sequence-independent adapter identification and removal in long-read shotgun sequencing data. Bioinformatics, 38(15), 3830-3832. https://doi.org/10.1093/bioinformatics/btac389