tskit
tskit copied to clipboard
`write_vcf` returning position 0
hey team-
working with @mufernando on some stdpopsim coding we stumbled upon a bug with write_vcf. using the default position_transform value, write_vcf can return position 0 which isn't allowed under the vcf spec i believe. e.g., this tree sequence
print(ts.tables.sites)
╔═════╤════════╤═══════════════╤════════╗
║id │position│ancestral_state│metadata║
╠═════╪════════╪═══════════════╪════════╣
║0 │ 0│ C│ ║
║1 │ 534│ C│ ║
║2 │ 991│ T│ ║
║3 │ 1188│ G│ ║
║4 │ 1281│ C│ ║
║5 │ 1301│ A│ ║
will return the following vcf
##fileformat=VCFv4.2
##source=tskit 0.5.5
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=5000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2 tsk_3 tsk_4 tsk_5 tsk_6 tsk_7 tsk_8 tsk_9 tsk_10 tsk_11 tsk_12 tsk_13 tsk_14 tsk_15 tsk_16 tsk_17 tsk_18 tsk_19
1 0 0 C G . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1 534 1 C G . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0 0|0
1
.
.
.
i'm guessing we should not have been using the default position_transform here but instead legacy? either way i think this is a bug
legacy will still give you 0s: https://github.com/tskit-dev/tskit/blob/main/python/tskit/vcf.py
... and it looks like a workaround is to pass position_transform = lambda x: 1 + np.round(x) to position_transform? But, this is very annoying - if we made this the default, then the 'position' in the VCF file would be off-by-one from the position in the tree sequence.
Alternatively we could special-case the situation where there's a mutation at position 0, ie, do position_transform = lambda x: np.fmax(1, np.round(x)) (and throw a warning if there is such a mutation)?
i think i would prefer the latter option.
I don't think there's a right answer here tbh. The VCF spec says it wants 1-based coords, but I'm not sure anything bad will happen if there is a zero position in there. Do we know if, say, bcftools will accept a vcf containing a single site with 0 coord?
It would definitely create more problems if we +1'd all the coords by default. For example, in sc2ts we store 1-based coordinates, so these would all be wrong if we started incrementing all coordinates by one in vcf output.
actually we found the bug because a downstream program threw an error on the VCF
we are creating a VCF in the stdpopsim/analysis2 pipeline here that was messing stuff up
https://github.com/popsim-consortium/analysis2/blob/eb16df28c44a1c3a04a1a5801172e806280f5c2f/workflows/sweep_simulate.snake#L399-L413
For the time being we can create that VCF with @petrelharp's suggested lambda function (i.e. position_transform = lambda x: np.fmax(1, np.round(x))) and then keep that project moving.
I think a lot of downstream code that handles VCFs will assume it's 1-based. Should I at least add a warning to the docs with this potential workaround?
That's a good suggestion - provide the solution in the docs.
Note that SLiM adds 1 to positions (which are 0-based in SLiM) when saving VCF, and subtracts 1 from positions when reading from VCF (the inverse operation). This means that a round-trip from SLiM to VCF and back preserves the correct information. It also means that, since tskit saves VCF with 0 for the position sometimes, SLiM errors out when reading some tskit-saved VCF files. Came up in this slim-discuss thread: https://groups.google.com/g/slim-discuss/c/cWwRxX2zNb0/m/CotUV5JyAAAJ. I don't know what the right course of action is for tskit at this point. :->
The more I ponder this, though, the less it seems like a doc bug. It's saving out non-spec-compliant VCF files, and that is causing problems with at least two software packages downstream of it. Seems like maybe a fix is in order, probably doing the same thing that SLiM does (which would also be nice for interoperability, of course).
I'm going to echo @bhaller's opinion.
I agree it is confusing to have the positions in the VCF be off by one when compared to positions in the tree sequence, but this is something everyone dealing with bioinformatics is aware of. I think it is a lot more confusing to have a VCF that is 0-based, specially if we want tree sequences to be adopted more widely.
I don't fully understand the ramifications of changing the default behavior, but now we have more evidence that the current behavior is messing with other people's downstream codes.
Correct me if I'm wrong, but I don't think that the issue is the usual 0-or-1-based indexing. If it was, then "gene X extends from position 1000 to position 1200" would mean two different things, that differ at each end by one bp. I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.
Correct me if I'm wrong, but I don't think that the issue is the usual 0-or-1-based indexing. If it was, then "gene X extends from position 1000 to position 1200" would mean two different things, that differ at each end by one bp. I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.
I don't understand the distinction you're drawing. Can you elaborate?
I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.
I agree with Peter here. Msprime generates sites that can have a 0 position, so you can say it uses 0-based coordinates. Sc2ts uses 1-based coordinates from real data, so it is 1-based. I'm pretty sure the ARGs we infer using tskit used 1-based coordinates. All use tskit.
Adding 1 to all coordinates is incorrect behaviour, as it changes the actual coordinates for everything. So, if we did this, then a VCF output from a sc2ts inferred ARG would have wrong coordinates.
I think this is more correctly seen as an issue with simulators generating 0 or 1 based coordinates, and tskit shouldn't have an opinion about that. I accept that VCF doesn't allow for a coordinate of 0 in the spec, so we should probably emit a warning if there is a site with this coord.
(As a follow up, we could easily add an option to msprime to disallow sites with position=0, which I'd be happy to discuss over there)
I'm pretty sure the ARGs we infer using tskit used 1-based coordinates.
Yes, all the tree sequences I'm inferring use 1-based and adding one on VCF output would be very unhelpful.
Here's a strawman suggestion for fixing this, although it is a hard breaking change:
write_vcf errors out if a 0 position is seen with this message:
A variant position of 0 was found in the tree sequence, this is not fully compliant with the VCF spec. If you still wish to write the VCF please use the "allow_zero_position" argument to write_vcf. Alternatively, you can increment all the positions by one using "position_transform = lambda x: 1 + x" or coerce the zero to one with "position_transform = lambda x: np.fmax(1, x)".
If that's too much breakage we could warn for a while before turning it on.
Ah, I see – you're saying tskit is 0/1-based agnostic, and while msprime is 0-based, other uses of tskit may not be, and so tskit should not "correct" values from 0-based to 1-based. OK, makes sense, and @benjeffery's proposal seems good to me. I like how the error message spells out the three possible solutions for the user.
Wait, so if we have a tree sequence with sequence_length=100 it is allowed to have sites at positions 0 and 100?
Nope, 0 would be allowed, but 100 not as site positions have to be less than sequence_length. So 99.99999999999 would be fine!
Nope,
0would be allowed, but100not as site positions have to be less thansequence_length. So99.99999999999would be fine!
Huh; so if you want to use tskit in a one-based fashion (perhaps planning to produce VCF output), and you wanted 100 positions starting at 1, and thus positions in [1, 100], you'd need to specify sequence_length of 101? Doesn't that imply that tskit is not really 0/1-based-agnostic, but is instead 0-based? Sorry, maybe I'm being picky; just trying to pin things down a bit more. :->
yes, that is where I was trying to get to @bhaller!
Yes, that's correct. msprime is zero-based so a sequence length of 100 gives you sites on 0-99. I guess tsinfer should be setting the sequence length to one greater than the ref sequence length, but it has never been an issue as there are never sites at the end.
Yes, @mufernando and @bhaller are correct in that tskit's interpretation of genome coordinates is zero-based (it allows a zero coordinate position). However, we do actively store coordinates in there that are 1-based, and in reality there's nothing to be gained from forcing people to substract 1 from all their coordinates.
Adding 1 to all coords in VCF output would definitely 100% break current and expected future usage, so we're not going to do that.
Ben's suggestion above of erroring out when ts.sites_position[0] == 0 is good I think.
I've written up a PR for this at https://github.com/tskit-dev/tskit/pull/2901
Would love some feedback from all of you here, as this is a fairly breaking change.
I wonder, would a better option be to warn instead of error if a zero position is printed, as suggested above? That avoids the breaking change, and instead a flag could be provided to optionally exclude or error on sites with position zero if encountered.
My opinion is that warnings often get missed/ignored, and writing non-compliant VCF is probably a bad thing, so making it an error seems fine to me. It is a breaking change, yes, but fixing it will be trivial for users, so that doesn't worry me too much – much worse are breaking changes that require substantial effort/thought by the user to cope with. I don't feel super strongly about it though, just my two cents. :->
I guess I have a bit of a different view: that it's the user's responsibility to pay attention to warnings and understand input and output of functions, and generally a bad idea to introduce breaking changes (even if minor) if they can be avoided.
But perhaps outputting a VCF with a zero position could be considered a bad enough bug that defaulting to an error is warranted. The function doesn't allow for decimal positions to be written, for example, so we're already enforcing some VCF standards (might as well enforce them all, incl. no zero position?).
The issue with it being the users' responsibility is that the user producing the VCF and the user consuming it are often different. By doing this we force the user to acknowledge that the VCF might not play nice downstream.
I agree that breaking changes suck, but at least this one is an easy fix.
I think it's ok to make this breaking change as it's best seen as a bug fix. I do hate to break people's code, but this seems like the lesser of two evils.
As I said over at that PR, I like this solution: there's an informative error message that provides the solution.
I understand I'm late here, but I have a comment about the proposed solution.
First, VCF should never have a position 0 in it, that's out of the question, so the error is welcome.
But as a user, I'm not very comfortable with any of the three options suggested. Instead, I'd more likely prefer that site just being dropped / filtered away, as if it was never simulated. With the suggested solutions, I could pick option 1 (keep 0 there) or 3 (change 0 to 1, hope you don't have a variant also at position 1), and filter this position afterwards, making it a two step process (which might be error-prone).
Perhaps it's worth adding this option to the error message, or even as a parameter to the function to automate the filtering for the user.
That said, I almost never use the tree sequence but generally convert it to other format (typically VCF), so interoperability of the two formats is not that important to me. But it may be important for others who do care about tree sequences and interoperability, and having the same number of SNPs across formats. Hence, I think it should be included as one of the options and users can pick. Of course, I may also care in the future, tree sequence does look useful for some future projects.