tskit icon indicating copy to clipboard operation
tskit copied to clipboard

`write_vcf` returning position 0

Open andrewkern opened this issue 2 years ago • 29 comments
trafficstars

hey team-

working with @mufernando on some stdpopsim coding we stumbled upon a bug with write_vcf. using the default position_transform value, write_vcf can return position 0 which isn't allowed under the vcf spec i believe. e.g., this tree sequence

print(ts.tables.sites)
╔═════╤════════╤═══════════════╤════════╗
║id   │position│ancestral_state│metadata║
╠═════╪════════╪═══════════════╪════════╣
║0    │       0│              C│        ║
║1    │     534│              C│        ║
║2    │     991│              T│        ║
║3    │    1188│              G│        ║
║4    │    1281│              C│        ║
║5    │    1301│              A│        ║

will return the following vcf

##fileformat=VCFv4.2
##source=tskit 0.5.5
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=5000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6	tsk_7	tsk_8	tsk_9	tsk_10	tsk_11	tsk_12	tsk_13	tsk_14	tsk_15	tsk_16	tsk_17	tsk_18	tsk_19
1	0	0	C	G	.	PASS	.	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	1|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0
1	534	1	C	G	.	PASS	.	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|0	0|0
1
.
.
.

i'm guessing we should not have been using the default position_transform here but instead legacy? either way i think this is a bug

andrewkern avatar Sep 08 '23 21:09 andrewkern

legacy will still give you 0s: https://github.com/tskit-dev/tskit/blob/main/python/tskit/vcf.py ... and it looks like a workaround is to pass position_transform = lambda x: 1 + np.round(x) to position_transform? But, this is very annoying - if we made this the default, then the 'position' in the VCF file would be off-by-one from the position in the tree sequence.

Alternatively we could special-case the situation where there's a mutation at position 0, ie, do position_transform = lambda x: np.fmax(1, np.round(x)) (and throw a warning if there is such a mutation)?

petrelharp avatar Sep 08 '23 23:09 petrelharp

i think i would prefer the latter option.

andrewkern avatar Sep 10 '23 05:09 andrewkern

I don't think there's a right answer here tbh. The VCF spec says it wants 1-based coords, but I'm not sure anything bad will happen if there is a zero position in there. Do we know if, say, bcftools will accept a vcf containing a single site with 0 coord?

It would definitely create more problems if we +1'd all the coords by default. For example, in sc2ts we store 1-based coordinates, so these would all be wrong if we started incrementing all coordinates by one in vcf output.

jeromekelleher avatar Sep 11 '23 08:09 jeromekelleher

actually we found the bug because a downstream program threw an error on the VCF

andrewkern avatar Sep 11 '23 14:09 andrewkern

we are creating a VCF in the stdpopsim/analysis2 pipeline here that was messing stuff up

https://github.com/popsim-consortium/analysis2/blob/eb16df28c44a1c3a04a1a5801172e806280f5c2f/workflows/sweep_simulate.snake#L399-L413

For the time being we can create that VCF with @petrelharp's suggested lambda function (i.e. position_transform = lambda x: np.fmax(1, np.round(x))) and then keep that project moving.

andrewkern avatar Sep 11 '23 18:09 andrewkern

I think a lot of downstream code that handles VCFs will assume it's 1-based. Should I at least add a warning to the docs with this potential workaround?

mufernando avatar Oct 03 '23 18:10 mufernando

That's a good suggestion - provide the solution in the docs.

petrelharp avatar Oct 04 '23 04:10 petrelharp

Note that SLiM adds 1 to positions (which are 0-based in SLiM) when saving VCF, and subtracts 1 from positions when reading from VCF (the inverse operation). This means that a round-trip from SLiM to VCF and back preserves the correct information. It also means that, since tskit saves VCF with 0 for the position sometimes, SLiM errors out when reading some tskit-saved VCF files. Came up in this slim-discuss thread: https://groups.google.com/g/slim-discuss/c/cWwRxX2zNb0/m/CotUV5JyAAAJ. I don't know what the right course of action is for tskit at this point. :->

bhaller avatar Feb 06 '24 22:02 bhaller

The more I ponder this, though, the less it seems like a doc bug. It's saving out non-spec-compliant VCF files, and that is causing problems with at least two software packages downstream of it. Seems like maybe a fix is in order, probably doing the same thing that SLiM does (which would also be nice for interoperability, of course).

bhaller avatar Feb 06 '24 22:02 bhaller

I'm going to echo @bhaller's opinion.

I agree it is confusing to have the positions in the VCF be off by one when compared to positions in the tree sequence, but this is something everyone dealing with bioinformatics is aware of. I think it is a lot more confusing to have a VCF that is 0-based, specially if we want tree sequences to be adopted more widely.

I don't fully understand the ramifications of changing the default behavior, but now we have more evidence that the current behavior is messing with other people's downstream codes.

mufernando avatar Feb 06 '24 22:02 mufernando

Correct me if I'm wrong, but I don't think that the issue is the usual 0-or-1-based indexing. If it was, then "gene X extends from position 1000 to position 1200" would mean two different things, that differ at each end by one bp. I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.

petrelharp avatar Feb 07 '24 00:02 petrelharp

Correct me if I'm wrong, but I don't think that the issue is the usual 0-or-1-based indexing. If it was, then "gene X extends from position 1000 to position 1200" would mean two different things, that differ at each end by one bp. I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.

I don't understand the distinction you're drawing. Can you elaborate?

bhaller avatar Feb 07 '24 02:02 bhaller

I think the issue is only that tskit allows a nucleotide at position 0, which is a much smaller deal.

I agree with Peter here. Msprime generates sites that can have a 0 position, so you can say it uses 0-based coordinates. Sc2ts uses 1-based coordinates from real data, so it is 1-based. I'm pretty sure the ARGs we infer using tskit used 1-based coordinates. All use tskit.

Adding 1 to all coordinates is incorrect behaviour, as it changes the actual coordinates for everything. So, if we did this, then a VCF output from a sc2ts inferred ARG would have wrong coordinates.

I think this is more correctly seen as an issue with simulators generating 0 or 1 based coordinates, and tskit shouldn't have an opinion about that. I accept that VCF doesn't allow for a coordinate of 0 in the spec, so we should probably emit a warning if there is a site with this coord.

jeromekelleher avatar Feb 07 '24 09:02 jeromekelleher

(As a follow up, we could easily add an option to msprime to disallow sites with position=0, which I'd be happy to discuss over there)

jeromekelleher avatar Feb 07 '24 09:02 jeromekelleher

I'm pretty sure the ARGs we infer using tskit used 1-based coordinates.

Yes, all the tree sequences I'm inferring use 1-based and adding one on VCF output would be very unhelpful.

Here's a strawman suggestion for fixing this, although it is a hard breaking change:

write_vcf errors out if a 0 position is seen with this message:

A variant position of 0 was found in the tree sequence, this is not fully compliant with the VCF spec. If you still wish to write the VCF please use the "allow_zero_position" argument to write_vcf. Alternatively, you can increment all the positions by one using "position_transform = lambda x: 1 + x" or coerce the zero to one with "position_transform = lambda x: np.fmax(1, x)".

If that's too much breakage we could warn for a while before turning it on.

benjeffery avatar Feb 07 '24 10:02 benjeffery

Ah, I see – you're saying tskit is 0/1-based agnostic, and while msprime is 0-based, other uses of tskit may not be, and so tskit should not "correct" values from 0-based to 1-based. OK, makes sense, and @benjeffery's proposal seems good to me. I like how the error message spells out the three possible solutions for the user.

bhaller avatar Feb 07 '24 15:02 bhaller

Wait, so if we have a tree sequence with sequence_length=100 it is allowed to have sites at positions 0 and 100?

mufernando avatar Feb 07 '24 15:02 mufernando

Nope, 0 would be allowed, but 100 not as site positions have to be less than sequence_length. So 99.99999999999 would be fine!

benjeffery avatar Feb 07 '24 18:02 benjeffery

Nope, 0 would be allowed, but 100 not as site positions have to be less than sequence_length. So 99.99999999999 would be fine!

Huh; so if you want to use tskit in a one-based fashion (perhaps planning to produce VCF output), and you wanted 100 positions starting at 1, and thus positions in [1, 100], you'd need to specify sequence_length of 101? Doesn't that imply that tskit is not really 0/1-based-agnostic, but is instead 0-based? Sorry, maybe I'm being picky; just trying to pin things down a bit more. :->

bhaller avatar Feb 07 '24 18:02 bhaller

yes, that is where I was trying to get to @bhaller!

mufernando avatar Feb 07 '24 18:02 mufernando

Yes, that's correct. msprime is zero-based so a sequence length of 100 gives you sites on 0-99. I guess tsinfer should be setting the sequence length to one greater than the ref sequence length, but it has never been an issue as there are never sites at the end.

benjeffery avatar Feb 07 '24 20:02 benjeffery

Yes, @mufernando and @bhaller are correct in that tskit's interpretation of genome coordinates is zero-based (it allows a zero coordinate position). However, we do actively store coordinates in there that are 1-based, and in reality there's nothing to be gained from forcing people to substract 1 from all their coordinates.

Adding 1 to all coords in VCF output would definitely 100% break current and expected future usage, so we're not going to do that.

Ben's suggestion above of erroring out when ts.sites_position[0] == 0 is good I think.

jeromekelleher avatar Feb 08 '24 09:02 jeromekelleher

I've written up a PR for this at https://github.com/tskit-dev/tskit/pull/2901

Would love some feedback from all of you here, as this is a fairly breaking change.

benjeffery avatar Feb 08 '24 13:02 benjeffery

I wonder, would a better option be to warn instead of error if a zero position is printed, as suggested above? That avoids the breaking change, and instead a flag could be provided to optionally exclude or error on sites with position zero if encountered.

apragsdale avatar Feb 08 '24 13:02 apragsdale

My opinion is that warnings often get missed/ignored, and writing non-compliant VCF is probably a bad thing, so making it an error seems fine to me. It is a breaking change, yes, but fixing it will be trivial for users, so that doesn't worry me too much – much worse are breaking changes that require substantial effort/thought by the user to cope with. I don't feel super strongly about it though, just my two cents. :->

bhaller avatar Feb 08 '24 14:02 bhaller

I guess I have a bit of a different view: that it's the user's responsibility to pay attention to warnings and understand input and output of functions, and generally a bad idea to introduce breaking changes (even if minor) if they can be avoided.

But perhaps outputting a VCF with a zero position could be considered a bad enough bug that defaulting to an error is warranted. The function doesn't allow for decimal positions to be written, for example, so we're already enforcing some VCF standards (might as well enforce them all, incl. no zero position?).

apragsdale avatar Feb 08 '24 17:02 apragsdale

The issue with it being the users' responsibility is that the user producing the VCF and the user consuming it are often different. By doing this we force the user to acknowledge that the VCF might not play nice downstream.

I agree that breaking changes suck, but at least this one is an easy fix.

benjeffery avatar Feb 08 '24 19:02 benjeffery

I think it's ok to make this breaking change as it's best seen as a bug fix. I do hate to break people's code, but this seems like the lesser of two evils.

jeromekelleher avatar Feb 08 '24 20:02 jeromekelleher

As I said over at that PR, I like this solution: there's an informative error message that provides the solution.

petrelharp avatar Feb 09 '24 03:02 petrelharp

I understand I'm late here, but I have a comment about the proposed solution.

First, VCF should never have a position 0 in it, that's out of the question, so the error is welcome.

But as a user, I'm not very comfortable with any of the three options suggested. Instead, I'd more likely prefer that site just being dropped / filtered away, as if it was never simulated. With the suggested solutions, I could pick option 1 (keep 0 there) or 3 (change 0 to 1, hope you don't have a variant also at position 1), and filter this position afterwards, making it a two step process (which might be error-prone).

Perhaps it's worth adding this option to the error message, or even as a parameter to the function to automate the filtering for the user.

That said, I almost never use the tree sequence but generally convert it to other format (typically VCF), so interoperability of the two formats is not that important to me. But it may be important for others who do care about tree sequences and interoperability, and having the same number of SNPs across formats. Hence, I think it should be included as one of the options and users can pick. Of course, I may also care in the future, tree sequence does look useful for some future projects.

janxkoci avatar Mar 17 '25 14:03 janxkoci