fgpyo icon indicating copy to clipboard operation
fgpyo copied to clipboard

Consider more advanced views into the alignments in Template

Open clintval opened this issue 11 months ago • 4 comments

After fixing an issue in Template where records were added twice to different internal fields in this PR:

  • https://github.com/fulcrumgenomics/fgpyo/pull/203

We might want to consider more specific "views" into the different categories of alignments for multi-mapping chimeric use cases.

Some ideas for useful ways to view the alignments for either the R1 or R2 ordinal:

  • Only primary (not secondary, not supplemental)
  • Only not primary (all secondary and supplementary)
  • Only secondary including "secondary supplementals"
  • Only secondary not including "secondary supplementals"
  • Only supplementary including "secondary supplementals"
  • Only supplementary not including "secondary supplementals"

clintval avatar Dec 30 '24 17:12 clintval

One mockup for a new class:

@dataclass(frozen=True)
class Template(Iterable):
    r1: AlignedSegment | None = None
    r2: AlignedSegment | None = None
    r1_auxiliaries: list[AlignedSegment] = field(default_factory=list)
    r2_auxiliaries: list[AlignedSegment] = field(default_factory=list)

    def r1_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_supplementary and not rec.is_secondary)
    
    def r2_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_supplementary and not rec.is_secondasry)
    
    def r1_secondaries(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_secondary and not rec.is_supplementary)
    
    def r2_secondaries(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_secondary and not rec.is_supplementary)

    def r1_secondary_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_secondary and rec.is_supplementary)
    
    def r2_secondary_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_secondary and rec.is_supplementary)

    def r1_and_supplementals(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r1 is None else [self.r1])
        yield from self.r1_supplementals()

    def r2_and_supplementals(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r2 is None else [self.r2])
        yield from self.r2_supplementals()

    def r1_secondary_and_supplementals(self) -> Iterator[AlignedSegment]:
        yield from self.r1_secondaries()
        yield from self.r1_secondary_supplementals()
        
    def r2_secondary_and_supplementals(self) -> Iterator[AlignedSegment]:
        yield from self.r2_secondaries()
        yield from self.r2_secondary_supplementals()

    def all_r1(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r1 is None else [self.r1])
        yield from self.r1_auxiliaries

    def all_r2(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r2 is None else [self.r2])
        yield from self.r2_auxiliaries

    def __iter__(self) -> Iterator[AlignedSegment]:
        yield from self.all_r1()
        yield from self.all_r2()

If the namespace is too cluttered we could shim a TemplateView class under a Template.view cached property that has a reference back to the template and provides different iterators upon the dataclass data. For example:

template.r1
template.r2
template.r1_auxiliaries
template.r2_auxiliaries

template.view.all_r1()
template.view.all_r2()
template.view.r1_secondaries()
template.view.r2_secondaries()
...

Additionally, I think we should make Template an iterable so you can do list(template) and for rec in template: ... and get back all of the alignments. Previously you had to do all_recs().

clintval avatar Feb 20 '25 22:02 clintval

+1 for making Template iterable over the constituent alignments.

I would suggest defining each iterator as a property instead of a method. (I find it more pythonic, though that may be subjective.) Similarly, I would vote against an intermediate view attribute, and in favor of the outlined API.

msto avatar Feb 21 '25 11:02 msto

I'm cautious that the proposed behavior could lead to some unexpected footguns.

Specifically, having r(1|2)_supplementals and r(1|2)_secondaries not return all alignments with the corresponding flag might be surprising to users.

I understand we're motivated by allowing users to collect all records associated with, say, a primary chimeric alignment. Perhaps we could consider more precise naming or additional iterator views?

e.g.

  • r1_supplementals -> All alignments marked supplementary
  • r1_primary_supplementals -> All alignments marked supplementary and not secondary
  • r1_secondary_supplementals -> All alignments marked supplementary and secondary

msto avatar Feb 21 '25 11:02 msto

@msto here's a second draft with some your feedback incorporated.

Some of the method names are a lot longer but are also now much less implicit:

@dataclass(frozen=True)
class Template(Iterable):
    r1: AlignedSegment | None = None
    r2: AlignedSegment | None = None
    r1_auxiliaries: list[AlignedSegment] = field(default_factory=list)
    r2_auxiliaries: list[AlignedSegment] = field(default_factory=list)

    def r1_all_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_supplementary)

    def r2_all_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_supplementary)

    def r1_primary_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_supplementary and not rec.is_secondary)

    def r2_primary_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_supplementary and not rec.is_secondary)

    def r1_secondary_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_supplementary and rec.is_secondary)

    def r2_secondary_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_supplementary and rec.is_secondary)

    def r1_primary_with_supplementals(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r1 is None else [self.r1])
        yield from self.r1_primary_supplementals()

    def r2_primary_with_supplementals(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r2 is None else [self.r2])
        yield from self.r2_primary_supplementals()

    def r1_secondary_with_supplementals(self) -> Iterator[AlignedSegment]:
        yield from self.r1_secondaries_wihout_supplementals()
        yield from self.r1_secondary_supplementals()

    def r2_secondary_with_supplementals(self) -> Iterator[AlignedSegment]:
        yield from self.r2_secondaries_without_supplementals()
        yield from self.r2_secondary_supplementals()

    def r1_secondaries_wihout_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r1() if rec.is_secondary and not rec.is_supplementary)

    def r2_secondaries_without_supplementals(self) -> Iterator[AlignedSegment]:
        yield from (rec for rec in self.all_r2() if rec.is_secondary and not rec.is_supplementary)

    def all_r1(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r1 is None else [self.r1])
        yield from self.r1_auxiliaries

    def all_r2(self) -> Iterator[AlignedSegment]:
        yield from ([] if self.r2 is None else [self.r2])
        yield from self.r2_auxiliaries

    def __iter__(self) -> Iterator[AlignedSegment]:
        yield from self.all_r1()
        yield from self.all_r2()

clintval avatar Feb 21 '25 18:02 clintval