minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

Potential duplicate generation of RG tag on inputs with RG information

Open SHuang-Broad opened this issue 10 months ago • 0 comments

Hi Heng,

This isn't necessarily a bug, but I was a bit surprised. Also, this definitely is not a high-impact issue.

So, this is arguably an edge case. When one has an input where each read has its associated RG (readgroup) information, that could be duplicated.

Here's an example.

Say the input is an unaligned BAM that has the RG tag for all its reads (with other tags like 5mC calls), one would run the command like the following

samtools fastq -t -T MM,ML <input_ubam> \
| minimap -ayYL -x <preset> -R "@RG\ID:matching_readgroup_id..." <ref> - \
| samtools sort -o output.bam

This will create two RG tags for each read. Of course, this can be averted without the -t flag in samtools fastq. But the documentation of samtools fastq says it'll copy not only RG, but also BC and QT tags, so one could still want to keep that flag. Alternatively, one can skip specifying the readgroup info for minimap2, and later add that by samtools reheader but this is extra work.

So, a convenient feature would be for minimap2 to check if the "comments" that would be copied from the input FASTQ come with RG. And if so, don't write that again based on the information provided via -R "@RG\ID:matching_readgroup_id...".

Thanks, Steve

SHuang-Broad avatar Aug 30 '23 03:08 SHuang-Broad