tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Error when the VCF will have a zero position site

Open benjeffery opened this issue 1 year ago • 9 comments

Fixes #2838

Note that this is a fairly breaking change that we should think about, given that the default is for msprime output to require the new flag to write_vcf.

benjeffery avatar Feb 08 '24 13:02 benjeffery

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (1601a5e) 89.72% compared to head (593d5a2) 92.97%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2901      +/-   ##
==========================================
+ Coverage   89.72%   92.97%   +3.24%     
==========================================
  Files          30       23       -7     
  Lines       30296    15937   -14359     
  Branches     5896     3140    -2756     
==========================================
- Hits        27183    14817   -12366     
+ Misses       1781      667    -1114     
+ Partials     1332      453     -879     
Flag Coverage Δ
c-tests ?
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.71% <0.00%> (-0.03%) :arrow_down:
python-tests 98.96% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
python/tskit/cli.py 96.96% <100.00%> (+0.02%) :arrow_up:
python/tskit/trees.py 98.74% <100.00%> (+<0.01%) :arrow_up:
python/tskit/vcf.py 98.44% <100.00%> (+0.02%) :arrow_up:

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1601a5e...593d5a2. Read the comment docs.

codecov[bot] avatar Feb 08 '24 14:02 codecov[bot]

One thought experiment is: what if some app using tskit is already generating all site positions on 1 <= x < seq length + 1?

molpopgen avatar Feb 08 '24 14:02 molpopgen

One thought experiment is: what if some app using tskit is already generating all site positions on 1 <= x < seq length + 1?

That wouldn't be a valid tree sequence as all positions have to be less than sequence length.

benjeffery avatar Feb 08 '24 19:02 benjeffery

That wouldn't be a valid tree sequence as all positions have to be less than sequence length.

Oops -- it would be if I'd written it correctly (w/o the +1). But my point should have been: we have no firm requirement that the minimum position actually used is zero.

molpopgen avatar Feb 08 '24 19:02 molpopgen

I'm not sure I get what you mean - you can have a tree sequence with no sites? Maybe you mean we have no firm specification for how the reference sequence in the tree sequence maps onto the position field. We don't even have a requirement that the ref seq length is equal to sequence_length-1.

benjeffery avatar Feb 08 '24 19:02 benjeffery

I'm not sure I get what you mean - you can have a tree sequence with no sites? Maybe you mean we have no firm specification for how the reference sequence in the tree sequence maps onto the position field. We don't even have a requirement that the ref seq length is equal to sequence_length-1.

Imagine that someone only considers positions from [10, seqlen). Their "genome" starts at position 10, not 0. That is a valid tree sequence and they can choose their seqlen so that the max allowed site position matches whatever they have in mind, say 100. So they are modeling a gene segment from positions [10, 100] for some reason using a table collection with seqlen of 101.

This is a valid use of the API. What would this PR do to this use case?

molpopgen avatar Feb 08 '24 19:02 molpopgen

I think my questions/confusion are related to some comments in the linked issue: seqlen must be > the max allowed position but the data model is not necessarily zero indexed.

I'll go away now...

molpopgen avatar Feb 08 '24 20:02 molpopgen

This is just checking to see if the first position is zero, nothing else. The seqlen stuff was a digression on the thread

jeromekelleher avatar Feb 08 '24 20:02 jeromekelleher

I love it (as much as is possible for a weird VCF hack). Nice solution.

petrelharp avatar Feb 09 '24 02:02 petrelharp