hostile icon indicating copy to clipboard operation
hostile copied to clipboard

Allow adding of Illumina Casava 1.8 format entry to fastq headers

Open charlesfoster opened this issue 1 year ago • 2 comments

Hi,

Thanks for the useful tool. Would it be possible to add the option to modify the headers of the output fastq files? I can see in aligner.py that the (final) command for generating clean reads with paired-end data is:

f" | samtools fastq --threads 4 -c 6 -N -1 '{fastq1_out_path}' -2 '{fastq2_out_path}'"

This results in read headers in the form of "@<read_id>/1". It would be useful to have the option to output clean reads with the Illumina Casava 1.8 format entry, which is an option in samtools fastq:

  -i           add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
  <truncated>
  --barcode-tag TAG
               Barcode tag [BC]
  --quality-tag TAG
               Quality tag [QT]
  --index-format STR
               How to parse barcode and quality tags

      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Minimally, the command could be:

f" | samtools fastq --threads 4 -c 6 -n -i --index-format "i*"  -1 '{fastq1_out_path}' -2 '{fastq2_out_path}'"

This results in read headers in the form of "@<read_id> 1:N:0:0".

In my case this is useful because my previous method of host decontamination (bowtie2 ... --un-conc-gz id.unmapped.fastq.gz ...) resulted in read headers in that format, and hence downstream read manipulation is based on that format. However, I can understand if this might not be a priority to implement. If not, could there be an option to save a *.bam file with clean reads, allowing users to also extract reads to file as they see fit?

Thanks

charlesfoster avatar Sep 04 '24 03:09 charlesfoster

Hi Charles, Thanks for raising this and helpfully suggesting an implementation. I will review this alongside some other planned improvements, which will be in the order of weeks. If you'd like to get this working in the meantime, I can offer guidance for improvising an interim solution in case you haven't already done so.

Thanks, Bede

bede avatar Sep 09 '24 19:09 bede

Great, thanks. I'll follow along :)

In the meantime I've just got the following hacky code in my Nextflow script which does the job:

## modify the file headers if need be
if [[ "\$artificial_cassava" == "true" ]]; then
    for file in *.clean_*.fastq.gz; do
        name=\$(basename "\$file")
        zcat "\$file" | awk '
        {
            if (NR % 4 == 1) {
                if (\$0 ~ /\\/1\$/) {
                    sub(/\\/1\$/, " 1:N:0:0")
                } else if (\$0 ~ /\\/2\$/) {
                    sub(/\\/2\$/, " 2:N:0:0")
                }
            }
            print
        }
        ' | gzip > tmp.fastq.gz
        mv tmp.fastq.gz "\$name"
    done
fi

charlesfoster avatar Sep 10 '24 00:09 charlesfoster

I'll add this in the next release, hopefully this week

bede avatar Dec 16 '24 18:12 bede

Released in 2.0.0

bede avatar Dec 19 '24 17:12 bede