sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Genome accessibility/callability

Open mufernando opened this issue 1 year ago • 4 comments

It is important to consider genome accessibility when computing rates from genomic data.

scikit-allel has options to include an "accessibility mask", a boolean array indicating whether a base is accessible or not, and can be used to properly normalize quantities.

I found mentions of implementing this in #341

I am happy to help make this happen, but since I am new to the codebase I'd need some hand-helding... Ideally we would need a way of reading BED files which can be attached to the genotype dataset. Then, when computing per base statistics, we would need to intersect the accessible intervals with the windows intervals to get the right denominator.

mufernando avatar May 22 '24 22:05 mufernando

Sounds like adding a bed2zarr command to vcf2zarr would be a great starting point - fancy taking it on???

jeromekelleher avatar May 23 '24 07:05 jeromekelleher

This is something I'm also interested in and I have mentioned it to you @jeromekelleher in the context of our analyses on spruce. What you're suggesting is that we first add a bed2zarr command that translates bed coordinates into Zarr 0/1-encoded arrays whose length should equal the sum of the contig lengths in contig_length. The windowed statistics would then need to be adjusted by a) excluding variant sites that are masked b) normalizing the windows by the number of accessible sites and not the window length (see https://onlinelibrary.wiley.com/cms/asset/2fb89448-2f39-4bef-bff2-c1fac98e120c/men13571-fig-0001-m.jpg for an overview of the effects of missing data).

@mufernando have you started looking into this or should I have a go?

percyfal avatar Aug 29 '24 11:08 percyfal

+1 for this. @percyfal and I have discussed this, and I would be happy to contribute here to help get this going.

cademirch avatar Sep 10 '24 22:09 cademirch

I am not working on this right now!

mufernando avatar Sep 10 '24 22:09 mufernando