sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

read_plink returns bytes for variant_alleles not unicode

Open jeromekelleher opened this issue 1 year ago • 0 comments

There's no good reason for returning bytes rather than utf8 unicode strings I think --- it can only lead to bugs in user code and inconsistencies in string handling (anyone remember Python 2???)

This is based on the "example" plink dataset in the test suite

       sg_ds = sgkit.io.plink.read_plink(path=path)
        print(sg_ds.variant_allele.values)
        print(sg_ds.variant_allele)

Gives

[[b'A' b'G']
 [b'T' b'C']]
<xarray.DataArray 'variant_allele' (variants: 2, alleles: 2)>
dask.array<astype, shape=(2, 2), dtype=|S1, chunksize=(2, 1), chunktype=numpy.ndarray>

jeromekelleher avatar Mar 06 '24 14:03 jeromekelleher