sgkit
sgkit copied to clipboard
Support mixed ploidy in display_genotypes
Currently a lower ploidy sample appears to have missing alleles i.e. -2 is treated as -1.
The the calls [[0, 0, 1, 1], [0, 1, -2, -2], [0, 0, 1, -1]] would ideally be displayed as 0/0/1/1 0/1 0/0/1/. in a mixed-ploidy dataset.
It's not obvious to me that that the VCF encoding is the right approach here - e.g., people might read your example above as implying the call are unphased when this is orthogonal. Isn't it better to be explicit about what our data encoding is? (I agree it's not obvious immediately what -2 and -1 mean, though)
implying the call are unphased
I'm not quite following this. Do you mean that within a single sample in which copy number is variable across a single chromosome, the -2 values of a call could be part of the phasing? I agree that we could be a bit more explicit than the VCF encoding.
The issue I'm facing is that mixed-ploidy/CNV genotype calls in a VCF that are encoded as as:
0/0/1/1 0/1 0/1/./.
are being displayed ambiguously by sg.display_genotypes as:
0/0/1/1 0/1/./. 0/1/./.
A more explicit option for sg.display_genotypes could be:
0/0/1/1 0/1/_/_ 0/1/./.
Which is more faithful to sgkits encoding and can handle out of order calls which may be necessary for some phasings e.g:
0|2|1|.
0|0|1|1
_|_|0|0
Sorry I was being unclear @timothymillar - I just meant that people might think there's phase information encoded in there as well if we're following the VCF way of doing things.
Maybe we could use some unicode box drawing characters to help display things more effectively? This is a "human only" encoding, so maybe we could do something a bit nicer if we're not restricted to ASCII?
do something a bit nicer if we're not restricted to ASCII
That could be great for users who aren't familiar with VCF. I think there will still be some use cases for (or users who prefer) displaying genotypes "like a VCF".
Maybe sg.display_genotypes could have a format argument with options in {"pretty", "VCF"}, or is that needlessly complicated?
That's a good idea I think - we can try to emulate VCF for those that are familiar with it, but default to something more human-readable without the legacy baggage. (Or do you think we should default to VCF?)
Defaulting to the most human readable option makes sense. I'm interested to see what that option would look like!
@jeromekelleher I'll close this via #1030 (just VCF representation). Do you want to open another issue for "pretty" formatting?