fgbio
fgbio copied to clipboard
CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r"
Problem
Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."^1
When sequencing a run with Unique Molecular Identifier (UMI) situated in the index2 (i5) on a NextSeq1000/2000 instrument, BCL Convert will put a leading "r" in front of the reverse-complemented UMI in the FASTQ header.
CopyUmiFromReadName
enforces that the UMI sequence contains only valid bases (A/C/G/T/N^2) or a delimiter between multiple UMIs (+
or -
). UMIs prefixed with "r" fail this validation.
Proposed solution
I think it would be sensible to add the following features to CopyUmiFromReadName
:
-
--umi-delimiter
(Char
, default=+
) -
Support reverse complemented UMIs.
- For each UMI, if it begins with "r", remove the "r" and (optionally?) reverse-complement the remaining sequence
- (NB: This could be turned off by default, e.g. with
--allow-reverse-umis
as a flag. @clintval raised the concern that degenerate UMIs could includer
as a masked A or G^4, although this does not appear to be permitted under the current Illumina FASTQ spec.^2)
> Restricted characters: A/T/G/C/N > UMI sequences for Read 1 and Read 2, separated by a plus [+].
> In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different barcodes.