hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Expand discussion of how to use a BAI/CSI index to compute query intersections

Open jmarshall opened this issue 3 years ago • 3 comments

As noted on PR #521, the specification does not fully describe how to use an index to check whether a particular read intersects with a query interval. In particular,

[The change around zero-consumption CIGARs] is in fact not really important for reg2bin and bin calculations at all — where the special case for unmapped or these reads is actually used is in another phase of interval queries, in checking that a putative intersecting read returned by walking the bin structure really does intersect the query interval. It's discussed in this spec section because the computation should be the same for querying and bin calculation, but the real problem is that the specification does not properly describe to this level of detail how to use bins to answer queries.

Originally posted in https://github.com/samtools/hts-specs/pull/521#issuecomment-677744051

jmarshall avatar Jan 06 '21 23:01 jmarshall

I'm not sure the quote is necessarily correct. It's the case that the zero length CIGARs weren't being used to update the bin linear index so the start..end ranges could be invalid for a zero-length read. However personally I view that as a bug and as far as bins go I believe it was working. Completely agree that the bin field should match the bin it was placed into for purposes of constructing the bin index during indexing.

I'm not sure what's missing though from the spec. It looks to give a reasonable description of how to use bins to answer queries. 5.1.1 answers that succintly as far as bins go and people could stop at that point and still get a working query (I think).

Where it gets fuzzy is linear indexes in 5.1.3. I'd like a better picture than the one used, showing bins, sequences, ranges covered and the linear index. Something similar to figure 1 in https://academic.oup.com/bioinformatics/article/26/14/1699/178142, but with linear index and multiple chunks instead of 1 as the 8-way tree gives holes between successive start..end ranges.

jkbonfield avatar Jan 07 '21 14:01 jkbonfield

I have some suboptimal Tikz to produce this type of plot:

image

I'm thinking maybe it needs to be less cluttered, or perhaps one like this and then some zoomed up spy-glass style elements in which we can discuss the specifics. That said, it also needs some markers for genomic coordinates. Things that can be demonstrated visually would be:

  • Location of reads into bins, showing long reads end up in bigger bins or short reads if overlapping boundaries.

  • Linear index, obviously.

  • Highlighted bins that come back from an X-to-Y query (shading).

  • Chunks, how many and where they start / end. How we'd use them from an indexing perspective to ignore bins that overlap but don't have data covering the specific query.

  • Omission of empty bins from the index.

jkbonfield avatar Feb 26 '21 17:02 jkbonfield

I finally found the source for this pic. I hunted high and low and then remembered it was on my home PC as viewing the PDF was much quicker there. It's pretty ghastly and not the ideal way to write it in tikz I'm sure, but here it is (ignore the header - it's just a cut down copy of the SAM spec to get boilerplate stuff in):

https://github.com/jkbonfield/hts-specs/blob/tikz_pic/pic.tex#L48-L141

jkbonfield avatar May 27 '21 14:05 jkbonfield