viralrecon icon indicating copy to clipboard operation
viralrecon copied to clipboard

Add option to hard trim amplicon primer sequences

Open drpatelh opened this issue 2 years ago • 4 comments

Description of feature

Context in this Tweet. Used by @peterk87 @fmaguire in this pre-print.

Be great if you can add a little description as to why hard-clipping was preferred over soft-clipping and anything else that you found on your travels that would be useful to know for the implementation including the command you used :)

Methods

Paired-end illumina reads from ARTIC V4 and Fusion Genomics sequencing were initially analysed separately
with the nf-core/viralrecon Nextflow workflow (v2.3) (Di Tommaso et al., 2017; Ewels et al., 2020; Patel et
al., 2022) which ran: FASTQC (v0.11.9) (Andrews, 2010) read-level quality control, fastp (v0.20.1) (Chen et al.,
2018) quality filtering and adapter trimming, Bowtie2 (v2.4.2) (Langmead & Salzberg, 2012) read mapping to
Wuhan-Hu-1 (MN908947.3) (Wu et al., 2020) SARS-CoV-2 reference, Mosdepth (v0.3.1)(Pedersen & Quinlan,
2018)/Samtools (v.1.12) (Li et al., 2009) read mapping statistics calculation, iVar (v1.3.1) (Grubaugh et al.,
2019) ARTIC V4 primer trimming, variant calling, and consensus generation; SnpEff (v5.0) (Cingolani, Platts,
et al., 2012)/SnpSift (v4.3t) (Cingolani, Patel, et al., 2012) for variant effect prediction and annotation; and
Pangolin (v3.1.20) (O’Toole et al., 2021) with PangoLEARN (2022-01-05), Scorpio (v0.3.16) (Colquhoun &
Jackson, 2021), and Constellations (v.0.1.1) was used for PANGO lineage (Rambaut et al., 2020) assignment.
iVar primer trimmed soft-clipped read alignments were converted to hard-clipped alignments with fgbio
ClipBam (http://fulcrumgenomics.github.io/fgbio/). Reads from hard-clipped BAM files were output to
FASTQ files with “samtools fastq”. nf-core/viralrecon was re-run in amplicon mode without iVar primer
trimming on the combined Fusion Genomics and ARTIC V4 primer trimmed FASTQ reads to generate the
variant calling, variant effect and consensus sequence results used in downstream analyses.

drpatelh avatar Mar 01 '22 14:03 drpatelh

Hi @drpatelh,

Several of the samples were difficult to sequence with very high Ct values, so we ended up trying a few different methods (ARTIC V4, 1200bp, FG) on Illumina as well as the 1200bp Freed et al (2020) method on GridION. Initial analysis suggested that the ARTIC V4 amplicon and Fusion Genomics (FG) probe-capture data were complementary with the FG data filling in a lot of gaps in coverage of the ARTIC and vice versa. So naturally, we wanted to merge the reads together while only trimming primers from the ARTIC amplicon reads.

Originally, I wanted to merge the BAM files from separate viralrecon analyses on the ARTIC V4 reads with primer trimming and FG probe-capture reads without primer trimming, but unfortunately, I wouldn't have been able to use the merged BAM files as input for viralrecon, and extracting the BAM file reads to FASTQ with samtools fastq did not remove soft-clipped regions in reads. So hard-clipping was required. Fortunately, like Nils Homer mentioned in his tweet, fgbio had the very convenient ClipBam command to "upgrade" clipping from soft to hard, so we could ensure that the primer trimmed regions would be trimmed from the FASTQ reads.

The Bash script below is basically what was done to each ARTIC V4 iVar trimmed BAM file from viralrecon:

fgbio ClipBam -i 4662.ivar_trim.sorted.bam -o 4662.ivar_trim.sorted.fgbio-hardclip.bam -H -r MN908947.3.fa
samtools fastq -1 4662_1.fq -2 4662_2.fq -0 /dev/null -s 4662.single.fq 4662.sorted.bam
cat 4662_1.fq 4662_2.fq 4662.single.fq | gzip -c > 4662.fq.gz

The fq.gz files were provided as input for viralrecon and viralrecon handled merging reads for the same samples.

Out of curiosity, I tried running viralrecon on the FG reads with ARTIC primer trimming and found that 10% of the FG reads would have primers trimmed. So I couldn't just naively run all the sequence data through viralrecon with primer trimming because a significant amount of data would be trimmed away.

We could have used CutAdapt to trim the primer sequences from the reads, but we haven't done enough testing with CutAdapt primer trimming and iVar has worked well for our primer trimming needs.

Let me know if you have any other questions!

Thanks for putting together such a great pipeline!

Peter

peterk87 avatar Mar 01 '22 19:03 peterk87

Hi @drpatelh,

Thanks a lot for viralrecon !

I recently found out I could have the use of such option to hard-clip primers and my case is a bit similar to @peterk87's except :

  • I will have to submit our raw sequencing data to a national plateform and they ask for uploaded FASTQ to be de-hosted + without primers (to avoid dealing with all different primers sets used by all different submiters)
  • I also started from .ivar_trim.sorted.bam files produces by viralrecon and tried both samtools fastq and bamtools bamtofastq, but as mentionned they include soft-clipped portions, without any option to change this behaviour (even if this was asked a while ago)

The little I investigated, ivar trim does not have an option to hard-clip primers instead => So IMHO it makes implementation of this option a bit more complex, because :

  • You would have either to add a step for viralrecon to convert soft-clipped portions to hard-clipped ones (to do so I personally went for ngsutilsj and its subcommand bam-removeclipping, even if marked as experimental in corresponding --help)
  • Or replace ivar trim step by samtools amplicon, but as I saw it should produce slighty different results and will require requalifying downstream steps like variant calling and so

Hope this helps ! Kind regards, Felix.

tetedange13 avatar Aug 19 '22 08:08 tetedange13

Hi @tetedange13 ! Thank you for using the pipeline and for your comments!

The Bash script below is basically what was done to each ARTIC V4 iVar trimmed BAM file from viralrecon:

fgbio ClipBam -i 4662.ivar_trim.sorted.bam -o 4662.ivar_trim.sorted.fgbio-hardclip.bam -H -r MN908947.3.fa samtools fastq -1 4662_1.fq -2 4662_2.fq -0 /dev/null -s 4662.single.fq 4662.sorted.bam cat 4662_1.fq 4662_2.fq 4662.single.fq | gzip -c > 4662.fq.gz

The implementation suggested by Peter above would suit your needs right? It removes soft-clipped regions from the iVar trimmed BAM files and then converts back to FastQ.

drpatelh avatar Aug 19 '22 09:08 drpatelh

Hi @drpatelh,

Yup, for what I saw produced FASTQ are identical with both fgbio ClipBam and ngsutilsj

Just to precise Peter's implementation, I could not manage to make fgbio v2.0.1 work with samtools sort -n => So here is the oneliner I used :

fgbio SortBam --input=sample.ivar_trim.sorted.bam --sort-order=QueryName --ouput=/dev/stdout |
    fgbio ClipBam --input=/dev/stdin --upgrade-clipping --output=/dev/stdout --ref=/viralrecon/data/GCA_009858895.3_ASM985889v3_genomic.200409.fna |
    samtools sort -n |
    samtools fastq -c 9 -0 /dev/null -s /dev/null -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz

Regarding performances, ngsutilsj seems a bit faster (I guess mostly because it does not require to have a query-sorted input BAM, contrary to fgbio ClipBam)

Thanks again ! Have a nice day, Felix.

tetedange13 avatar Aug 31 '22 12:08 tetedange13