fgbio icon indicating copy to clipboard operation
fgbio copied to clipboard

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r"

Open msto opened this issue 1 year ago • 0 comments

Problem

Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."^1

When sequencing a run with Unique Molecular Identifier (UMI) situated in the index2 (i5) on a NextSeq1000/2000 instrument, BCL Convert will put a leading "r" in front of the reverse-complemented UMI in the FASTQ header.

CopyUmiFromReadName enforces that the UMI sequence contains only valid bases (A/C/G/T/N^2) or a delimiter between multiple UMIs (+ or -). UMIs prefixed with "r" fail this validation.

Proposed solution

I think it would be sensible to add the following features to CopyUmiFromReadName:

  • --umi-delimiter (Char, default=+)

    • The default should be +, as this is the default delimiter in Illumina FASTQs.^2
    • If this character appears in the UMI sequence, split the sequence into multiple UMIs and validate each separately.
    • Join multiple UMIs with a hyphen (-) before storing them in the RX tag, per SAM spec.^3
  • Support reverse complemented UMIs.

    • For each UMI, if it begins with "r", remove the "r" and (optionally?) reverse-complement the remaining sequence
    • (NB: This could be turned off by default, e.g. with --allow-reverse-umis as a flag. @clintval raised the concern that degenerate UMIs could include r as a masked A or G^4, although this does not appear to be permitted under the current Illumina FASTQ spec.^2)

> Restricted characters: A/T/G/C/N > UMI sequences for Read 1 and Read 2, separated by a plus [+].

> In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different barcodes.

msto avatar Jan 19 '24 16:01 msto