tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

VariantData sequence_length not being taken from zarr `contig_length` field

Open hyanwong opened this issue 1 year ago • 7 comments

I'm having problems getting VariantData.sequence_length to match the value in the contig_length field of the zarr file. It wasn't clear until reading the code that I needed to set z.attrs["sequence_length"]. If there isn't a z.attrs["sequence_length"] defined, should we take the value from contig_length instead?

I was having the problem when trying to round-trip data from a tree sequence via a vcf file into another inferred tree sequence, and the sequence lengths didn't match up.

hyanwong avatar Aug 14 '24 01:08 hyanwong

To use contig_length we would need to know which value from that array to use. There's the usual issue here that sgkit has many contigs, whereas tsinfer only allows one. One option would be to error out if there is only one contig - but that seems overly restrictive.

benjeffery avatar Aug 15 '24 09:08 benjeffery

Don't we need to error out if the variants cover multiple contigs? I guess only if the sites mask leaves more than one contig available. Presumably we check this when running tsinfer on the VariantData object?

That implies (to me) that the sequence length is essentially dependent on the mask used, right? Following on from that thought, the sequence length can be picked from the contig_length field (and cached) after checking the uniqueness of the (masked) CHROM field, I suppose? Sounds a bit complex though!

hyanwong avatar Aug 15 '24 09:08 hyanwong

If sequence_length isn't present we currently default to last_position+1. We don't currently error on differing chroms - but we do error if positions aren't monotonically increasing.

benjeffery avatar Aug 16 '24 11:08 benjeffery

If there is more than one contig present in the VCF Zarr we require a contig parameter to be provided in the constructor. If a contig_length is present, we use this as sequence_length if not, we do the last_position + 1 thing.

I don't think we should be requiring any custom Zarr attrs, as we won't have write access in many cases.

jeromekelleher avatar Aug 27 '24 10:08 jeromekelleher

If there is more than one contig present in the VCF Zarr we require a contig parameter to be provided in the constructor. If a contig_length is present, we use this as sequence_length if not, we do the last_position + 1 thing.

Sorry, I'm not getting this. At the moment users can quite happily make a .vcz file with multiple defined contigs, and feed that to tsinfer. It then only barfs because positions aren't monotonically increasing. I just wrote some extra docs to show how to mask out all but one contig.

Are you saying that we should detect multiple contigs when making a VariantData object, and require one of them to be specified? If so, then I guess it would be also reasonable to get the equivalent contig_length from the .vcz file (if it exists), which would be handy.

I don't think we should be requiring any custom Zarr attrs, as we won't have write access in many cases.

Yes, that seems the right approach.

hyanwong avatar Sep 04 '24 10:09 hyanwong

Yeah, it seems that if there are multiple contigs in the VCF then it would be helpful to tell tsinfer which one it should use. You're right that this cuts across the site masking though, so it's not entirely clear what the right thing to do here is. Perhaps just providing a sequence_length option which the user can fill in themselves is the easiest.

jeromekelleher avatar Sep 04 '24 11:09 jeromekelleher

I think the best thing (as long as it doesn't kill efficiency) is to allow a contig to be specified (default None), and if so, start with that as the site mask, then OR the mask with any provided site mask.

We can then check if the (masked) variant_contig array has only a single value, and store that in a VariantData attribute (this would be useful anyway, I think). Then it's easy to pick out a contig_length, if it exists (or use the N+1 hack if it doesn't).

hyanwong avatar Sep 04 '24 12:09 hyanwong