sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Scalable genetics toolkit

Results 216 sgkit issues
Sort by recently updated
recently updated
newest added

On some PRs on https://app.codecov.io/gh/pystatgen/sgkit/pulls?page=1&state=open&order=-pullid it says "Missing base report"

At the moment we use `requirements.txt` for creation of environment for development and usage. Although its quick to get started with pip, I think its a good idea to start...

``` ================================== FAILURES =================================== _________________ test_vcfzarr_to_zarr[None-True-True-False] __________________ shared_datadir = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_vcfzarr_to_zarr_None_True1/data') tmpdir = local('C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_vcfzarr_to_zarr_None_True1') grouped_by_contig = True, consolidated = True, has_variant_id = False concat_algorithm = None @pytest.mark.parametrize( "grouped_by_contig, consolidated, has_variant_id", [...

IO

Given the presence of wheels for all 3 of our upstream IO libraries, I think it makes sense to favor convenience now and have `pip install sgkit` pull in the...

- [VCF 4.2 spec](https://samtools.github.io/hts-specs/VCFv4.2.pdf) - Example VCF file: https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz - [cyvcf2.pyx](https://github.com/brentp/cyvcf2/blob/master/cyvcf2/cyvcf2.pyx) - Header types: 'CONTIG', 'FILTER', 'FORMAT', 'GENERIC', 'INFO' - [vcf_reader.py](https://github.com/pystatgen/sgkit/blob/master/sgkit/io/vcf/vcf_reader.py) ### ##INFO - These fields are (usually?) per variant...

IO
data representation

Raising this issue to discuss API for selecting data from a given genome region, which could be either a whole contig or a contiguous region within a contig. Breaking this...

data representation

It appears that this function does not scale well when run on a cluster. Notes from my most recent attempt: - The code I ran is here: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/blob/4f862e31b8093d25fdaa8da7f841b9be8583cda4/scripts/gwas.py#L268 - This...

performance

#454 helped with GWAS performance, but as mentioned in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768411149, there is scope for further improvement since the transfer time is still a significant proportion of the compute time.

performance

In #390 (and processing in general), using [preemptible instances](https://cloud.google.com/compute/docs/instances/preemptible) on GCP would bring a [cost saving of ~5x](https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/32#issuecomment-748934617).

performance

Introduced in NumPy 1.20.0: https://numpy.org/doc/stable/release/1.20.0-notes.html#numpy-is-now-typed These would replace our types in `sgkit.typing`.

process + tools