htslib icon indicating copy to clipboard operation
htslib copied to clipboard

bgzip/tabix bedpe indexing (2D queries)

Open blajoie opened this issue 6 years ago • 1 comments

Hi,

I was wondering if there has been any discussion around extending bgzip/tabix to support BEDPE files (to ultimately achieve 2D block-compression + retrieval). Useful for Hi-C interaction matrices, SV variants, LD matrix etc.

  1. Is this even possible?
  2. Is this a desirable extension to bgzip/tabix? (the equivalent can be achieved via block-compressed HDF5 files)

test.bedpe - example bedpe file chr1 1 500 chr1 500 1000 itx_1 1000 chr1 1 500 chr1 1500 2000 itx_2 1000 chr1 1 500 chr2 1000 1500 itx_3 1000 chr1 1000 1500 chr1 500 1000 itx_3 1000

example 2D tabix queries:

$ tabix test.bedpe chr1:1-500 chr2:1000-1500 chr1 1 500 chr2 1000 1500 itx_3 1000

$ tabix test.bedpe chr1:1-500 chr1:1-2000 chr1 1 500 chr1 500 1000 itx_1 1000 chr1 1 500 chr1 1500 2000 itx_2 1000

Whether or not the BEDPE must contain all entries symmetrically or if this is something tabix could handle internally (e.g. the bedpe only contains interval_1 < interval_2) is something to be decided.

I imagine most of the work may be on the bgzip size. Any thoughts? Something on the nearby horizon? :)

blajoie avatar Jul 12 '17 02:07 blajoie

Looks like an incarnation of this already exists! https://github.com/4dn-dcic/pairix

blajoie avatar Jul 14 '17 23:07 blajoie