d4-format icon indicating copy to clipboard operation
d4-format copied to clipboard

Efficiently summing coverage tracks from multiple d4 files

Open percyfal opened this issue 8 months ago • 3 comments

Hi,

I'm running variant calling on non-model organisms, and for some of the downstream analyses (e.g., nucleotide diversity calculations), it is necessary to generate (possibly boolean) accessibility masks that classify sites as accessible for analysis at a single base-pair resolution. Accessibility masks can be generated by summing coverages over all samples and masking out sites with too low or too high coverage. In addition, one could mask sites based on the number/fraction of individuals having sufficient coverage, i.e., absence/presence calls (cf https://onlinelibrary.wiley.com/doi/full/10.1111/mec.16077, Table 3). The genomes in question are so large that it is not possible to generate variant files including monomorphic sites on which to perform filtering.

Until now I have been using the Python API to sum coverages and count the number of indivuduals with coverages within a threshold range for each site. This is somewhat slow so I was wondering whether this functionality could be added directly to the d4 Rust library. I gave it a try based on the merge function, but my Rust knowledge is somewhat limited.

I'm thinking of commands somewhere in the line of

d4tools sum file1.d4 file2.d4 ... fileN.d4 outfile.d4

and

d4tools count file1.d4 file2.d4 ... fileN.d4 outfile.d4 --min-coverage 3

I'd be happy to submit a pull request if I could get pointers on where to start. What are your thoughts on this - do you prefer cases like these to be handled by external APIs (e.g., Python) or is it amenable to implementation in Rust?

Cheers,

Per

percyfal avatar Jun 24 '24 09:06 percyfal