fgpyo icon indicating copy to clipboard operation
fgpyo copied to clipboard

Feature: SAM tag enum

Open msto opened this issue 1 year ago • 8 comments

I think it would be helpful to provide an enum that describes the standard SAM tags.

I am happy to implement this but would appreciate feedback on design considerations before starting.

I have two primary questions.

  1. Should the member names be the actual SAM tags, or more semantically meaningful?

    e.g.

    class SamTag(str, Enum):
        """Standard SAM tags."""
    
        RG: "RG"
        """Read group."""
    
        RX: "RX"
        """Sequence bases of the (possibly corrected) unique molecular identifier."""
    

    or

    class SamTag(str, Enum):
        """Standard SAM tags."""
    
        READ_GROUP: "RG"
        """Read group."""
    
        UMI: "RX"
        """Sequence bases of the (possibly corrected) unique molecular identifier."""
    

    (note that I suggest mixing in str so the enums can be passed directly to pysam's tagging functions, e.g. read.has_tag(SamTag.UMI))

  2. To permit extensions to custom tags, I think providing a class decorator would be more helpful than subclassing. The decorator would

    1. inject the standard tags
    2. enforce uniqueness using enum.unique
    3. enforce that custom tags are two-character strings, and
    4. optionally enforce that custom tags adhere to SAM convention, namely that tags start with "X", "Y", or "Z", or are lowercase

    (i) and (ii) would be achievable with subclassing, but I think a decorator is necessary to provide the validations on custom tags. Thoughts?

    e.g.

    @sam_tag(strict=False)
    class ClientTag(str, Enum):
        """Custom SAM tags used for $client."""
    
        FOO: "XF"
        """Foo."""
    
        BAR: "XB"
        """Bar."""
    

msto avatar Mar 12 '24 13:03 msto

Does this belong in pysam, or in htslib and exposed in pysam? Have we asked them if they would accept a contribution? And if not, why?

nh13 avatar Mar 15 '24 18:03 nh13

I can open a similar issue on pysam.

I don't think I have a clear mental model on distinguishing the features we consider for inclusion in fgpyo vs attempting to submit upstream, so I started here

msto avatar Mar 15 '24 18:03 msto

https://github.com/pysam-developers/pysam/issues/1272

msto avatar Mar 15 '24 18:03 msto

@nh13 Given the lack of response from pysam, would this be re-considered for inclusion in fgpyo?

I have an initial implementation at https://github.com/msto/sam_tags/ that I would love to merge in here

msto avatar Apr 17 '24 06:04 msto

@nh13 bump 🙂

msto avatar May 27 '24 17:05 msto

@nh13 bump

msto avatar Jul 12 '24 14:07 msto

I am ok with this. @tfenne ?

nh13 avatar Jul 13 '24 19:07 nh13

I'm game for the first half - i.e. an enum that encodes standard values. I generally prefer to use the tags for the name (so e.g. NM, SA) rather than a longer name because the former is explicit and well-known and I think this is mostly to prevent typos?

I'm not sure I see the value (yet) in the decorator for making other enums of sam tags. I.e. I tend to think we should use str anywhere we require a tag, and then you can supply one of these enum values, a str directly or another similar enum.

tfenne avatar Jul 13 '24 22:07 tfenne