fgpyo
fgpyo copied to clipboard
Add PairedFastqReader
Briefly discussed with @nh13 last week
In short, it'd be helpful to extend pysam's FastxFile/FastxRecord and have a TemplateIterator analogue for paired FASTQs, e.g.
class PairedFastxReader:
def __init__(
fq1_iter: Iterator[pysam.libcfaidx.FastxRecord],
fq2_iter: Iterator[pysam.libcfaidx.FastxRecord],
) -> Self:
pass
def __next__(self) -> PairedFastxRecord:
pass
@nh13 suggested we permit passing lists of files instead of limiting to a single file per read. I like this and think it might be appropriate as a classmethod rather than the primary initialization method? Other possible alternative initializations:
- [ ] from file path rather than from
Iteratoror open handle - [ ] from lists of file paths
- [ ] from directory name + sample prefix, globbing appropriate files
I can take this on, but I'd appreciate input on fleshing out features/requirements/naming convention before getting started @nh13 @tfenne
For Illumina, we could have up to four separate FASTQs: R1, R2, I1, and I2. In some cases, we have multiple files for R1. So really, rather than paired, I want the constructor to take in a List[List[Iterator[FastxRecord]]]. This could be a single param, or with *args (preferred?). So the outer list has typical length 2-4 (e.g. just R1 and R2, or just R1/R2/I1/I2), while the lengths of the inner list all must be the same, and the # records in each iterator are the same tool (this is strict I know, but in practice always the case).
I'd keep the __init__ method as general as you have it, and then class methods that allows us to have a Path instead of an iterator. I'd avoid baking in globbing in this library for now.
I would return a tuple in __next__. Any other info other than the list of records we need to return?
I think it might be clearer for the constructor to take a named argument for each list of iterators, rather than a list of lists.
e.g.
class GroupedFastxReader:
def __init__(
r1_iter: Iterator[pysam.libcfaidx.FastxRecord],
r2_iter: Iterator[pysam.libcfaidx.FastxRecord],
i1_iter: Optional[Iterator[pysam.libcfaidx.FastxRecord]],
i2_iter: Optional[Iterator[pysam.libcfaidx.FastxRecord]],
) -> Self:
pass
and provide class methods that allow initializing with lists of iterators (that are chained then passed to __init__()), Paths, and lists of Paths. Thoughts?
the lengths of the inner list all must be the same, and the # records in each iterator are the same tool (this is strict I know, but in practice always the case).
I agree these are both important to verify.
Any other info other than the list of records we need to return?
Perhaps the read name? So similarly to Template we might have something like:
class GroupedFastxRecord:
name: str
r1: pysam.libcfaidx.FastxRecord
r2: pysam.libcfaidx.FastxRecord
i1: Optional[pysam.libcfaidx.FastxRecord]
i2: Optional[pysam.libcfaidx.FastxRecord]
I like it, but would only make one modification that we can certainly sequence fragment reads and a single (or dual) barcodes. E.g. R1/I1, or R1/I1/I2. So I would make r2 optional, but have validation that you have at least two non-None iterators (and records). Thoughts? This argues for using a frozen attrs class for the grouped record.
One could consider using (validators in attrs)[https://www.attrs.org/en/stable/api-attr.html#module-attr.validators] for the grouped record, but in some cases it may be overly strict: https://github.com/fulcrumgenomics/fgpyo/blob/d2d50c46c22521f99823da843a1b6a3b80711117/fgpyo/sam/init.py#L716-L719
Yep, and also require that i1 is provided if i2 is provided.
After releasing 0.1.0 just now I thought I'd groom the backlog to see if I needed to close any issues out and just discovered this one @msto. I wish I had seen this sooner as I had a similar need for this functionality and implemented it as FastxZipped! We're still pre-1.0.0 so I would love your opinion on the API and if you think it needs any obvious enhancements and/or solves your use case.
Closing for now since we have FastxZipped in fgpyo which could evolve if we wanted it to behave differently. There is also an implementation outside of fgpyo which is specific to paired-end read FASTQ iteration (instead of arbitrary numbers of FASTQs):
- https://dnaio.readthedocs.io/en/latest/api.html#dnaio.TwoFilePairedEndReader