sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Don't display attributes expanded for dataset

Open jeromekelleher opened this issue 2 years ago • 5 comments

When you look at a dataset derived from VCF in a notebook, you get this:

Screenshot from 2023-12-14 13-00-11 Screenshot from 2023-12-14 12-59-49

The attributes are automatically "open",and this means that the VCF header attibute (which will be several megabytes for large datasets) dominates.

I'm not sure this is something we can influence, but can we either truncate the vcf header attribute for display, or tweak the display of the dataset somehow to at least keep the attributes "closed" by default?

jeromekelleher avatar Dec 14 '23 13:12 jeromekelleher

Alternatively we could discard the "#CHROM ..." line of the VCF header, since we can reproduce it using the sample_id variable. Also, it's wrong when we do a subset operation.

jeromekelleher avatar Dec 14 '23 13:12 jeromekelleher

It can be controlled with an xarray setting: https://docs.xarray.dev/en/stable/generated/xarray.set_options.html#xarray-set-options

timothymillar avatar Dec 14 '23 17:12 timothymillar

This originally came up here: https://github.com/pystatgen/sgkit/issues/463#issuecomment-827445369

tomwhite avatar Dec 15 '23 09:12 tomwhite

As a quick aside @tomwhite, do we ever use the "#CHROM POS.." line from the vcf header? If not I think we should discard it, as there's no real information there (i'll open an issue)

jeromekelleher avatar Dec 15 '23 09:12 jeromekelleher

@jeromekelleher we used to use the "#CHROM POS.." line to support round-tripping of VCF -> Zarr -> VCF, but we can generate the header now, so it may not be necessary to store it. See https://github.com/pystatgen/sgkit/blob/2ab47b587768bed166d3c477694bed06250123c9/sgkit/io/vcf/vcf_writer.py#L412-L559

tomwhite avatar Dec 18 '23 15:12 tomwhite