htslib
htslib copied to clipboard
bgzip/tabix bedpe indexing (2D queries)
Hi,
I was wondering if there has been any discussion around extending bgzip/tabix to support BEDPE files (to ultimately achieve 2D block-compression + retrieval). Useful for Hi-C interaction matrices, SV variants, LD matrix etc.
- Is this even possible?
- Is this a desirable extension to bgzip/tabix? (the equivalent can be achieved via block-compressed HDF5 files)
test.bedpe - example bedpe file chr1 1 500 chr1 500 1000 itx_1 1000 chr1 1 500 chr1 1500 2000 itx_2 1000 chr1 1 500 chr2 1000 1500 itx_3 1000 chr1 1000 1500 chr1 500 1000 itx_3 1000
example 2D tabix queries:
$ tabix test.bedpe chr1:1-500 chr2:1000-1500 chr1 1 500 chr2 1000 1500 itx_3 1000
$ tabix test.bedpe chr1:1-500 chr1:1-2000 chr1 1 500 chr1 500 1000 itx_1 1000 chr1 1 500 chr1 1500 2000 itx_2 1000
Whether or not the BEDPE must contain all entries symmetrically or if this is something tabix could handle internally (e.g. the bedpe only contains interval_1 < interval_2) is something to be decided.
I imagine most of the work may be on the bgzip size. Any thoughts? Something on the nearby horizon? :)
Looks like an incarnation of this already exists! https://github.com/4dn-dcic/pairix