Add ability to generate custom tags e.g. RX tag for UMIs by parsing read names

Open eboyden opened this issue 6 years ago • 1 comments

I have looked extensively for a tool that allows parsing a sam read name to extract information and place it in a (customizable) tag, especially for UMI sequences present in read names. E.g. bcl2fastq can extract UMI sequences and put them at the end of fastq read names (with a "+" delimiter in the case of dual UMIs), but Picard and many other downstream tools require UMIs to be in sam tags, without providing a convenient means of moving the umi from the read name to a tag (umi-tools can sort of do this, but it only generates tags for read 1 of paired reads). Instead Picard requires generating unmapped bams (which most aligners cannot handle directly) from bcls, converting those ubams to fastqs for alignment, and then merging the mapped and unmapped bams to replace the lost metadata. This is a complicated solution to a simple problem, and a non-starter for folks using pipelines that require bcl2fastq and aren't able to convert their fastqs to ubams.

Hence, it would be great if samtools had a tool to do this; ideally it would be a generic and customizable read name parser/tag generator, but even something specific for moving/copying bcl2fastq-extracted umi sequences from the read name to an RX tag (and optionally removing delimiters e.g. "+" if present, or changing them to "_" which is Picard-compatible) would be a much appreciated feature.

Nov 26 '19 03:11 eboyden

It's not something we've got support for currently in Samtools, and given there are a variety of other tools out there it's not high on the priority list just now.

Realistically, if we documented a general purpose read name tokeniser coupled with a rule engine that permits tokens to be migrated to form tags and to describe how to rearrange tokens to create new read names, then you're creating a new mini programming language. Maybe using one of the existing ones is a better solution.

Possible alternative solutions could be:

Use Bambi (https://github.com/wtsi-npg/bambi/) as an alternative to bcl2fastq. Bambi goes directly from bcl to sam (or bam / cram) including parsing of barcode tags, so it ought to get you to your destination in a single step instead of needing intermediate file formats and subsequent processing steps.
For SAM / BAM / CRAM files that you already have, you could process them as SAM format using a quick (m)awk, perl, etc one-liner to regexp match the read name and convert it to a new name + RX tag. (eg something like samtools view -h ~/scratch/data/novaseq.10m.bam | mawk '/^@/ {print;next} {N=split($1,n,":");print $0 "\tBX:Z:" n[N]}' - mawk is comparable speed to the samtools view command here, so not really a bottleneck)
Similarly you could use samjdk (http://lindenb.github.io/jvarkit/SamJdk.html) to rewrite the data on-the-fly with a piece of embedded Java code. One for the programmers out there, but it looks quite powerful. See for example https://www.biostars.org/p/408279/ which shows a trivial way of renaming barcode tags. (I don't know how performant this is vs a fast mawk script though; that'd be my go to language personally.)

Edit: for curiosity, the perl equivalent to the awk 1-liner above would be the more verbose perl -lane 'if (/^@/) {print;next} @n=split(":",$F[0]);chomp($_);print $_."\tBX:Z:".$n[-1];'. Speed wise on 1 million records perl took 6.5s, gawk took 4.4s and mawk took 1.8s.

Long live awk! ;-)

Nov 26 '19 16:11 jkbonfield