sgkit
sgkit copied to clipboard
simulate_genotype_call_dataset creates duplicate alleles
E.g. we can get 2 "C" values in ds['variant_allele']:
import sgkit as sg
import numpy as np
ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
for i, alleles in enumerate(ds['variant_allele'].values):
print(f"Site {i}: {alleles}")
assert len(np.unique(alleles)) == len(alleles)
Fails on site 6:
Site 6: [b'T' b'T']
---------------------------------------------------------------------------
AssertionError
This can cause much confusion in downstream analysis. See https://github.com/tskit-dev/tsinfer/issues/927