sgkit
                                
                                 sgkit copied to clipboard
                                
                                    sgkit copied to clipboard
                            
                            
                            
                        Unused alleles being `""` is confusing for downstream tools - consider `None`?
tsinfer needs to check the number of alleles at each variant. Currently, the only safe way to do this for a VCF-derived sgkit dataset is to iterate over the entire genotype array and check if each allele is actually referred to, less than ideal for very large datasets. This is because unused alleles are given the value "" which could represent a deletion. As variant_allele is an object array, maybe the unused alleles could be set to None?
Also useful to think about how we can get the number of alleles per site - filtering out biallelic is something basic people will want to do.
This seems like a good idea. Best thing is probably to try it out and see if it triggers any failure in the unit tests.
Great, I'm happy to have a go at this.
Temporarily putting this into the 0.6.0 release in case it can be changed now without breaking people's code
I just realized that the "" usage is from the spec, https://github.com/pystatgen/vcf-zarr-spec/blob/main/vcf_zarr_spec.md#missing-and-fill-values, so may be harder to change (and also was in the 0.5.0 release I think).
@benjeffery it would be useful to have a minimal reproducible example/failing test to show what the problem is.
Dropping this from 0.6.0 then
The problem I initially faced was that I converted a biallelic VCF using the default options (which gives 4 alleles), then ran tsinfer.  tsinfer filters for biallelic so no sites were used for inference. So I then modified tsinfer to ignore alleles that are "" when counting. In doing this I realised that "" could be a deletion, and there is no way to know if that is the case without checking all the genotypes.
Sorry it's not more minimal/simple than that.
I've just come across this issue when trying to calculate the number of unique alleles at each site. I don't actually think this should be an issue if we are following the VCF spec, because all variants are required to be padded to include at least one base:
- REF — reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g. complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String “<ID>”) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required). If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G), the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically (thus R as a reference base is converted to A in VCF.)
- ALT — alternate base(s): Comma-separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N (case insensitive) or the ‘*’ symbol (allele missing due to overlapping deletion) or a MISSING value ‘.’ (no variant) or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in Section 5.4. If there are no alternative alleles, then the MISSING value must be used. In other words, the ALT field must be a symbolic allele, or a breakend replacement string, or match the regular expression ^([ACGTNacgtn]+|*|.)$. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
TLDR: the VCF spec does not allow "" to indicate a deletion because a variable locus should always be extended so that all REF and ALT alleles contain at least 1 base. So the current vcf-zarr-spec seems to be consistent with the VCF spec.
For the sake of clarity, We could also include a variant_allele_fill mask value to indicate fill values (""). I also was considering adding a derived variant_allele_unique value to record the number of unique alleles at each locus.
Thanks for that @timothymillar, looks like I'm doing the right thing in tsinfer then. I agree that a variant_allele_fill mask would make this unambiguous though.
I noticed that the dtype for variant_allele can be bytes or objects ('O'). This makes it unclear what a generic 'non-allele' value could be. Is there a particular use-case for an object array? I would like to have a generic function to count the number of unique alleles at each variant.
Good question - maybe it's something to do with ragged multi-character alleles (e.g. indels)?
That makes sense, I'd guess that checking for null values in an object array should include checking for any of b'', '', None to be safe. But I think we should still encourage the use of b'' for null values for consistency (if it comes up).