htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

Support for long reference sequences

Open jayoung opened this issue 8 years ago • 17 comments

Hi there,

I'd like to add another voice for providing support for longer reference sequences: it already seems useful for some genomes, and as better assembled and larger genomes come out it seems like the need will get more frequent.

I've been working a little with the opossum monDom5 assembly where two chromosomes too long to be handled by IGV (chr1 is 748 Mb and chr2 is 541 Mb). I've been told it's an underlying limitation of htsjdk that creates this issue for IGV with longer reference sequences, so I'm posting the request here too.

Thanks for considering it,

Janet Young


Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Division of Basic Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fredhutch.org


jayoung avatar Jun 15 '17 17:06 jayoung

as far as I know the max length of a chromosome is max(int32)= 2,147,483,647 so there shouldn't have any problem on this side.

There may be a problem when you're using a binning index (like bam.bai or tabix.tbi) : where the max size is only 512Mb (http://genomewiki.ucsc.edu/index.php/Bin_indexing_system ). But you can always use a csi index: https://www.biostars.org/p/111984/

lindenb avatar Jun 15 '17 18:06 lindenb

Yes, the problem is the index. Does htsjdk support csi indexes? My understanding is it does not. I'm looking at the source code but don't see anything.

jrobinso avatar Jun 16 '17 03:06 jrobinso

It doesn't support CSI as far as I know.

lbergelson avatar Jun 16 '17 14:06 lbergelson

How much of a burden would it be to have support for csi indices?

On Fri, Jun 16, 2017 at 10:02 AM, Louis Bergelson [email protected] wrote:

It doesn't support CSI as far as I know.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/900#issuecomment-309034123, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0pfnDdfi3komGZOeL1zaw8ATPB2lks5sEor2gaJpZM4N7i9G .

yfarjoun avatar Jun 16 '17 14:06 yfarjoun

I'm not really sure, but we haven't always done a great job of handling index related things in a clean way, and I suspect we have client code that relies on the details of the bai index since it's been the only one we support for bam. I'd guess that it would be a fair amount of work to implement.

lbergelson avatar Jun 16 '17 17:06 lbergelson

from what I saw online, bai is a specialization of csi....so it might not be so bad...it would be good for someone knowledgable about the format to step in though....

yfarjoun avatar Jun 16 '17 17:06 yfarjoun

Thanks all for looking into this: hoping someone capable (I am not capable) is interested in taking this on at some point.

Janet

jayoung avatar Jul 08 '17 01:07 jayoung

There is another issue open regarding CSI index support in htsjdk: #447

nathanhaigh avatar Dec 14 '17 00:12 nathanhaigh

Hi,

Picard has a PR that allows to work with CSI index (from https://github.com/samtools/htsjdk/pull/1040) https://github.com/broadinstitute/picard/pull/998

But I need htsjdk to be able to write CSI as well (example of Wheat genome):

Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543. Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543. Caused by: java.lang.IllegalStateException: Read position too high for BAI bin indexing.

How can I help?

FredericBGA avatar Apr 02 '20 09:04 FredericBGA

I think GATK SHOULD work with bams with a CSI index already as well. There's no VCF support at the moment which will probably be problematic though.

I don't think we have any plan to implement it on our own in the near future. We just don't have the bandwidth to look into it, especially with the current pandemic craziness which is significantly cutting into people's coding time.

However, if you're interested in contributing and have a fairly significant chunk of time on your hands we'd be able to work with you on getting a CSI writer into htsjdk. In theory it shouldn't be too hard, it's basically just a tiny modification to the bai writer. I suspect there will be a lot of pain points though with how the code is structured and (poorly) tested.
The index code is unfortunately, really gross. It wasn't really designed to be extensible. There's also a weird divide between the bam index code and the vcf index code, the CSI could in theory be used for VCF to support long references, but currently we only support tabix. As far as I understand vcf CSI and bam CSI have to be slightly different due to how extra metadata is encoded in the bai/ bam csi.

@cmnbroad IS possibly going to be trying to modernize some of the index code it in the near future. He might be able to offer some guidance as to where to start if you're interested.

We'd love the help, but it will probably take a while.

lbergelson avatar Apr 03 '20 17:04 lbergelson

Hey @lbergelson @cmnbroad hang in there, must be crazy where you are.

RE index modernization, ping me if you do that, I'm interested in modernizing the igv.js code as well, might be some synergy. Currently both bam and tabix indexes are handled by the same code, with a few switches, this in turn was adapted from even older code. At least its relatively small https://github.com/igvteam/igv.js/blob/master/js/bam/bamIndex.js

jrobinso avatar Apr 03 '20 19:04 jrobinso

Perhaps the code developed for use by JBrowse could help?

BAM Index

  • https://github.com/GMOD/bam-js/blob/master/src/bai.ts
  • https://github.com/GMOD/bam-js/blob/master/src/csi.ts

Tabix Index

  • https://github.com/GMOD/tabix-js/blob/master/src/tbi.ts
  • https://github.com/GMOD/tabix-js/blob/master/src/csi.ts

nathanhaigh avatar Apr 03 '20 20:04 nathanhaigh

Thank you for all these comments. I'm not sure that my skills (java and computing) will be enough, but I'm still ready to help. Of course I'm stuck at home right now like half of the people on Earth, so this is not the best timing but we could get back in touch when you managed to find time to make progress on this issue.

FredericBGA avatar Apr 08 '20 06:04 FredericBGA

Hi Guys, First of all, thank you for the read support for .csi indices.

We start seeing more and more genomes where the contigs exceed the limits of the .bai index. While it's good that htsjdk can read these indices, we need write support, as well.

Not being able to write a .csi index is a significant limitation when working with large, well assembled, plant genomes. And we believe that this will become more important as well assembled genomes become more frequent, due to the rise of long read technologies.

Adding this functionality would be very much appreciated.

aschaetz avatar Jun 01 '22 10:06 aschaetz

there was a project samtools/htsjdk-next-beta which aim was to implement long REFs (eg.: https://github.com/samtools/htsjdk-next-beta/pull/6 ) . But the project looks inactive (dead ?) now.

lindenb avatar Jun 01 '22 10:06 lindenb

Yes, sadly no activity there.

aschaetz avatar Jun 09 '22 17:06 aschaetz

My group (Buckler Lab at Cornell University) also has a need for VCF CSI support for software we write/maintain. Our open source code (Buckler Lab PHG) uses gvcf files to store variant information for plant genomes and has run into problems working with the larger genomes .e.g wheat.

Has anything changed in terms of scheduling CSI support for VCF? We would be interested in working on this in the htsjdk code base if there were someone with whom we could consult. @lbergelson @cmnbroad (or anyone else) will you comment/ update on where this stands? Thanks

lynnjo avatar Nov 08 '22 13:11 lynnjo