sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Genomic coordinate systems

Open tomwhite opened this issue 5 years ago • 5 comments

Are genomic positions in sgkit 0-based or 1-based? We should decide and document the coordinate system conventions that sgkit uses.

[Please correct any mistakes I've made below!]

When reading VCF, PLINK, and BGEN files we leave the position variables unchanged, so they use the convention of the underlying file format. These are as follows:

  • VCF is 1-based, fully closed http://samtools.github.io/hts-specs/VCFv4.3.pdf
  • PLINK (.bim) is 1-based https://www.cog-genomics.org/plink/2.0/formats#bim
  • BGEN (I can't find documentation saying which convention this uses)

In simulate_genotype_call_dataset we create positions starting at 0, which implies they are 0-based. So this is inconsistent with reading the file formats above.

For comparison here are the conventions that other libraries and tools use.

Whichever convention we adopt, we will likely need to support multiple ways of specifying intervals for selection, or windowing, which may use either convention. (This article has a good summary of options.)

For example, chr20:1-100 (for selecting, see #25) is 1-based, fully closed (i.e. the end position is inclusive). On the other hand BED (which would be a useful option for specifying windows) uses 0-based, half-open intervals (i.e. the end position is exclusive).

tomwhite avatar Jan 05 '21 13:01 tomwhite

I had hoped that we could be agnostic and just reflect what's in the underlying data files - it's a horrible can of worms. There will always be some case where we do the wrong thing if we try to "correct" the input data, as it's a complete mess out there.

Can we get away with this, just saying "it's up to you to understand what coordinates your data uses"?

The simulation data we produce doesn't matter as it's only toy data for testing anyway.

jeromekelleher avatar Jan 05 '21 17:01 jeromekelleher

Intervals are a different issue I think - we should be opinionated in this case and do half-open like BED.

jeromekelleher avatar Jan 05 '21 17:01 jeromekelleher

I've thought a bit more about this, and I think we probably can't avoid taking a position on this. Fundamentally, a user who is using (standards-compliant) VCF and BED files together would expect us to interpret them correctly. So, we probably do need to choose either one-based or zero-based positions, and add some APIs for shifting input files that may differ by one (as @eric-czech suggested).

The question is, which one do we choose? We will be wrong in a large number of people's eyes whichever we choose (unless we split the difference and add/substract 0.5 from all coordinates :wink: ).

jeromekelleher avatar Jan 08 '21 16:01 jeromekelleher

Just wanted to note here that the GA4GH Variation Representation Specification uses inter-residue coordinates.

tomwhite avatar Nov 11 '21 12:11 tomwhite

Tutorial:Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

Python CLI: https://github.com/griffithlab/convert_zero_one_based

tomwhite avatar Nov 30 '22 11:11 tomwhite