sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Support mixed ploidy in display_genotypes

Open timothymillar opened this issue 4 years ago • 6 comments

Currently a lower ploidy sample appears to have missing alleles i.e. -2 is treated as -1. The the calls [[0, 0, 1, 1], [0, 1, -2, -2], [0, 0, 1, -1]] would ideally be displayed as 0/0/1/1 0/1 0/0/1/. in a mixed-ploidy dataset.

timothymillar avatar May 18 '21 02:05 timothymillar

It's not obvious to me that that the VCF encoding is the right approach here - e.g., people might read your example above as implying the call are unphased when this is orthogonal. Isn't it better to be explicit about what our data encoding is? (I agree it's not obvious immediately what -2 and -1 mean, though)

jeromekelleher avatar Jun 01 '21 18:06 jeromekelleher

implying the call are unphased

I'm not quite following this. Do you mean that within a single sample in which copy number is variable across a single chromosome, the -2 values of a call could be part of the phasing? I agree that we could be a bit more explicit than the VCF encoding.

The issue I'm facing is that mixed-ploidy/CNV genotype calls in a VCF that are encoded as as:

0/0/1/1    0/1        0/1/./.

are being displayed ambiguously by sg.display_genotypes as:

0/0/1/1    0/1/./.    0/1/./.

A more explicit option for sg.display_genotypes could be:

0/0/1/1    0/1/_/_    0/1/./.

Which is more faithful to sgkits encoding and can handle out of order calls which may be necessary for some phasings e.g:

0|2|1|.
0|0|1|1
_|_|0|0

timothymillar avatar Jun 01 '21 23:06 timothymillar

Sorry I was being unclear @timothymillar - I just meant that people might think there's phase information encoded in there as well if we're following the VCF way of doing things.

Maybe we could use some unicode box drawing characters to help display things more effectively? This is a "human only" encoding, so maybe we could do something a bit nicer if we're not restricted to ASCII?

jeromekelleher avatar Jun 02 '21 15:06 jeromekelleher

do something a bit nicer if we're not restricted to ASCII

That could be great for users who aren't familiar with VCF. I think there will still be some use cases for (or users who prefer) displaying genotypes "like a VCF".

Maybe sg.display_genotypes could have a format argument with options in {"pretty", "VCF"}, or is that needlessly complicated?

timothymillar avatar Jun 02 '21 22:06 timothymillar

That's a good idea I think - we can try to emulate VCF for those that are familiar with it, but default to something more human-readable without the legacy baggage. (Or do you think we should default to VCF?)

jeromekelleher avatar Jun 03 '21 16:06 jeromekelleher

Defaulting to the most human readable option makes sense. I'm interested to see what that option would look like!

timothymillar avatar Jun 03 '21 21:06 timothymillar

@jeromekelleher I'll close this via #1030 (just VCF representation). Do you want to open another issue for "pretty" formatting?

timothymillar avatar Feb 27 '23 01:02 timothymillar