Results 132 issues of Tom White

Alleles are a challenge to represent efficiently in fixed-length arrays. There are a couple of problems: 1. the number of alleles is not known until the whole VCF file has...

data representation

See https://github.com/pystatgen/sgkit/pull/303#discussion_r507906940

Add a check like the one mentioned in https://github.com/pystatgen/sgkit/pull/573#issue-646597266 to the docs GitHub workflow.

documentation

The `maximal_independent_set` algorithm currently pulls the whole LD matrix into memory. We could improve this by partitioning by contig, as suggested in https://github.com/pystatgen/sgkit/pull/561#pullrequestreview-659092266.

On some PRs on https://app.codecov.io/gh/pystatgen/sgkit/pulls?page=1&state=open&order=-pullid it says "Missing base report"

``` ================================== FAILURES =================================== _________________ test_vcfzarr_to_zarr[None-True-True-False] __________________ shared_datadir = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_vcfzarr_to_zarr_None_True1/data') tmpdir = local('C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_vcfzarr_to_zarr_None_True1') grouped_by_contig = True, consolidated = True, has_variant_id = False concat_algorithm = None @pytest.mark.parametrize( "grouped_by_contig, consolidated, has_variant_id", [...

IO

#454 helped with GWAS performance, but as mentioned in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768411149, there is scope for further improvement since the transfer time is still a significant proportion of the compute time.

performance

In #390 (and processing in general), using [preemptible instances](https://cloud.google.com/compute/docs/instances/preemptible) on GCP would bring a [cost saving of ~5x](https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/32#issuecomment-748934617).

performance

Introduced in NumPy 1.20.0: https://numpy.org/doc/stable/release/1.20.0-notes.html#numpy-is-now-typed These would replace our types in `sgkit.typing`.

process + tools