htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

SAMFileWriterFactory creates .bai file when writing .cram file

Open rickymagner opened this issue 1 year ago • 1 comments

Description of the issue:

When using the SAMFileWriterFactory to write a .cram file, when the "create index" default is toggled on, it will create a .bai file for the index rather than .crai. This means that e.g. when running gatk MergeSamFiles --CREATE_INDEX… with a .cram output, you end up with an output.cram.bai file instead of output.cram.crai.

Your environment:

  • version of htsjdk: 3.0.1
  • version of java: 17
  • which OS: MacOS

Steps to reproduce

Run gatk MergeSamFiles as described above.

Expected behaviour

You should get a .crai file.

Actual behaviour

You get a .bai file.


There are a few very old issues surrounding .crai files in the repo. According to this issue it seems like support was added for this but kept off for reasons discussed here. Perhaps it's too much to resurrect the project of getting these indices sorted out, but at the moment is seems GATK just silently puts out .cram.bai files due to this, which can be pretty confusing. I don't know enough about CRAM vs BAM to know how bad it might be to use one index for the other, but at least GATK seems to work just fine doing random access on CRAMs with the .bai file produced as described above. Also not sure if this issue should be pushed up to GATK or kept down here in htsjdk. At the very least it'd be nice if the library could be updated to use the proper file extension for the index.

rickymagner avatar Jul 18 '23 22:07 rickymagner

@rickymagner It's actually producing a bai index, not a crai. So it would be equally wrong to rename it to crai. It would be great to fix it to make a crai index but I think it's a bit of a project.

lbergelson avatar Aug 01 '23 14:08 lbergelson