fgpyo
fgpyo copied to clipboard
Feature: SAM tag enum
I think it would be helpful to provide an enum that describes the standard SAM tags.
I am happy to implement this but would appreciate feedback on design considerations before starting.
I have two primary questions.
-
Should the member names be the actual SAM tags, or more semantically meaningful?
e.g.
class SamTag(str, Enum): """Standard SAM tags.""" RG: "RG" """Read group.""" RX: "RX" """Sequence bases of the (possibly corrected) unique molecular identifier."""or
class SamTag(str, Enum): """Standard SAM tags.""" READ_GROUP: "RG" """Read group.""" UMI: "RX" """Sequence bases of the (possibly corrected) unique molecular identifier."""(note that I suggest mixing in
strso the enums can be passed directly to pysam's tagging functions, e.g.read.has_tag(SamTag.UMI)) -
To permit extensions to custom tags, I think providing a class decorator would be more helpful than subclassing. The decorator would
- inject the standard tags
- enforce uniqueness using
enum.unique - enforce that custom tags are two-character strings, and
- optionally enforce that custom tags adhere to SAM convention, namely that tags start with "X", "Y", or "Z", or are lowercase
(i) and (ii) would be achievable with subclassing, but I think a decorator is necessary to provide the validations on custom tags. Thoughts?
e.g.
@sam_tag(strict=False) class ClientTag(str, Enum): """Custom SAM tags used for $client.""" FOO: "XF" """Foo.""" BAR: "XB" """Bar."""
Does this belong in pysam, or in htslib and exposed in pysam? Have we asked them if they would accept a contribution? And if not, why?
I can open a similar issue on pysam.
I don't think I have a clear mental model on distinguishing the features we consider for inclusion in fgpyo vs attempting to submit upstream, so I started here
https://github.com/pysam-developers/pysam/issues/1272
@nh13 Given the lack of response from pysam, would this be re-considered for inclusion in fgpyo?
I have an initial implementation at https://github.com/msto/sam_tags/ that I would love to merge in here
@nh13 bump 🙂
@nh13 bump
I am ok with this. @tfenne ?
I'm game for the first half - i.e. an enum that encodes standard values. I generally prefer to use the tags for the name (so e.g. NM, SA) rather than a longer name because the former is explicit and well-known and I think this is mostly to prevent typos?
I'm not sure I see the value (yet) in the decorator for making other enums of sam tags. I.e. I tend to think we should use str anywhere we require a tag, and then you can supply one of these enum values, a str directly or another similar enum.