sgkit
sgkit copied to clipboard
Scalable genetics toolkit
On some PRs on https://app.codecov.io/gh/pystatgen/sgkit/pulls?page=1&state=open&order=-pullid it says "Missing base report"
At the moment we use `requirements.txt` for creation of environment for development and usage. Although its quick to get started with pip, I think its a good idea to start...
``` ================================== FAILURES =================================== _________________ test_vcfzarr_to_zarr[None-True-True-False] __________________ shared_datadir = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_vcfzarr_to_zarr_None_True1/data') tmpdir = local('C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_vcfzarr_to_zarr_None_True1') grouped_by_contig = True, consolidated = True, has_variant_id = False concat_algorithm = None @pytest.mark.parametrize( "grouped_by_contig, consolidated, has_variant_id", [...
Given the presence of wheels for all 3 of our upstream IO libraries, I think it makes sense to favor convenience now and have `pip install sgkit` pull in the...
- [VCF 4.2 spec](https://samtools.github.io/hts-specs/VCFv4.2.pdf) - Example VCF file: https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz - [cyvcf2.pyx](https://github.com/brentp/cyvcf2/blob/master/cyvcf2/cyvcf2.pyx) - Header types: 'CONTIG', 'FILTER', 'FORMAT', 'GENERIC', 'INFO' - [vcf_reader.py](https://github.com/pystatgen/sgkit/blob/master/sgkit/io/vcf/vcf_reader.py) ### ##INFO - These fields are (usually?) per variant...
Raising this issue to discuss API for selecting data from a given genome region, which could be either a whole contig or a contiguous region within a contig. Breaking this...
It appears that this function does not scale well when run on a cluster. Notes from my most recent attempt: - The code I ran is here: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/blob/4f862e31b8093d25fdaa8da7f841b9be8583cda4/scripts/gwas.py#L268 - This...
#454 helped with GWAS performance, but as mentioned in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768411149, there is scope for further improvement since the transfer time is still a significant proportion of the compute time.
In #390 (and processing in general), using [preemptible instances](https://cloud.google.com/compute/docs/instances/preemptible) on GCP would bring a [cost saving of ~5x](https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/32#issuecomment-748934617).
Introduced in NumPy 1.20.0: https://numpy.org/doc/stable/release/1.20.0-notes.html#numpy-is-now-typed These would replace our types in `sgkit.typing`.