sgkit
sgkit copied to clipboard
Example popgen notebook
It would be good to have something that demonstrates a popgen workflow.
I converted one of @alimanfoo's MalariaGEN analyses to use sgkit a while back (#232). The problem with converting that to JupyterBook (#500) and including it as a part of the doc build is that the dataset is ~15GB in size, and the whole thing takes a while to run. So something smaller would be preferable - in the same way that #463 uses a cutdown dataset, for example.
More notebooks to bring over at https://github.com/pystatgen/sgkit/discussions/674#discussioncomment-1343388
I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file).
I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file).
That file has just the sites, no sample genotypes I'm afraid.
Did you want to start the examples from VCFs? The MalariaGEN Ag3 data release provides one VCF per sample, so you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before zarr conversion. A bit more info about VCF downloads here.
Alternatively you could get an sgkit-style xarray dataset directly from the malariagen_data API, example here. By changing the sample_sets parameter you can select a smaller number of samples if you wanted. Data is all in GCS though, so even with a smaller number of samples it's probably not ideal if you want to run the notebook as part of the doc build.
Worth considering running the notebook manually, outside of the doc build?
Hey @alimanfoo! Indeed I discovered the lack of samples in that VCF file after a quick inspection.
I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for sgkit users.
My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook?
I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for
sgkitusers.
No problem. Out of interest, were there any differences between the xarray dataset returned from our snp_calls() API method and the sgkit model? If so would be good to know, that method is intended to be a bridge to using sgkit on our data.
My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook?
You could start with a single contig - 3L is probably good as smallest autosomal contig - and you could also choose a single sample set - e.g., AG1000G-BF-B might be a good one, it has 102 samples, with representation from all three mosquito species we sampled. There are also smaller sample sets if you want even slimmer, see here for what's in each of the sample sets.
you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before Zarr conversion.
@alimanfoo May I ask you to clarify this point? Is it a sgkit function, or does a standard tool like BCFtools serve the purpose better?
Link to notebook: https://github.com/tomwhite/shiny-train/blob/sgkit/notebooks/gwss/sgkit_h12.ipynb