htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

htsjdk.samtools.SAMException sequence name doesn't match regex

Open kviljoen opened this issue 4 years ago • 7 comments

Description of the issue:

I have a SAM file with restricted characters (in my case commas) in the sequence names that I'm trying to load into IGV. I can convert SAM to BAM and index, but get a regex error when trying to load the BAM file into IGV: Error loading BAM file: htsjdk.samtools.SAMException: Sequence name 'gi|545903863|ref|NZ_BATA01000117.1|:1938-2201,2205-2258' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*' I realize that the characters: ‘\ , "‘’ () [] {} <>’ are restricted so I'm not sure if this is a htsjdk issue or rather with how the sequences were named in the first place? Replacing those characters will allow the file to load but it would be great to have a more sustainable solution to this. Screenshot attached Screen Shot 2020-04-07 at 3 14 24 PM

This issue has also been described here https://groups.google.com/forum/#!msg/igv-help/8wRmwA-4skE/6Zzq4ZUPBQAJ

Environment:

  • version of htsjdk: Default installed with IGV v2.5.2
  • version of java: 1.8.0_171
  • which OS: Mac OSX 10.12.6

Steps to reproduce

Loading a .bam file in IGV with File -> Load from File

Expected behaviour

Successful file load

Actual behaviour

Error as in screenshot.

kviljoen avatar Apr 08 '20 08:04 kviljoen

@kviljoen This is an understandable pain. Those characters were disallowed in SAM sequence names in a relatively recent update of the SAM specs/ htsjdk. We found it was necessary to disallow a number of characters because they are incompatible with downstream formats (they break VCF parsing for instance). Unfortunately no one explicitly stated the policy for naming chromosomes in early versions of SAM because I think people just assumed that no one would use any weird characters (an obviously faulty assumption...).

We decided to add this check in to stop new instances of bad names occurring, but it has the side effect of causing pain for people who have existing data with these characters in it. I don't currently have a good workaround other than renaming the sequence. (or using an old version of IGV from before we added that check.).

lbergelson avatar Aug 05 '20 19:08 lbergelson

if it's a bam could one just replace the header with one that has better names? the bam references the sequences with an index, so theoretically, if one puts a header that has the right order it should "just work"...no?

On Wed, Aug 5, 2020 at 3:15 PM Louis Bergelson [email protected] wrote:

@kviljoen https://github.com/kviljoen This is an understandable pain. Those characters were disallowed in SAM sequence names in a relatively recent update of the SAM specs/ htsjdk. We found it was necessary to disallow a number of characters because they are incompatible with downstream formats (they break VCF parsing for instance). Unfortunately no one explicitly stated the policy for naming chromosomes in early versions of SAM because I think people just assumed that no one would use any weird characters (an obviously faulty assumption...).

We decided to add this check in to stop new instances of bad names occurring, but it has the side effect of causing pain for people who have existing data with these characters in it. I don't currently have a good workaround other than renaming the sequence. (or using an old version of IGV from before we added that check.).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1471#issuecomment-669421548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUSMBQSGSSXPRQ6MVYLR7GVVNANCNFSM4MDYHD6Q .

yfarjoun avatar Aug 05 '20 19:08 yfarjoun

Yesish. I think there might be complications if you have things like SA tags which are text tags that reference the contig names.

lbergelson avatar Aug 05 '20 20:08 lbergelson

that's unfortunate! but at least the file will be viewable in igv...

On Wed, Aug 5, 2020 at 4:33 PM Louis Bergelson [email protected] wrote:

Yesish. I think there might be complications if you have things like SA tags which are text tags that reference the contig names.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1471#issuecomment-669488924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUSZCJSA2XNUQULTMZDR7G6YFANCNFSM4MDYHD6Q .

yfarjoun avatar Aug 05 '20 20:08 yfarjoun

I run into the same problem. In my particular case, the BAM files are from alignment by STAR. And the alignment is based on a genome index generated by STAR.

The STAR manual suggests that chrName.txt (containing sequence names) in the genome index directory can be changed (as long as the order of the sequences is preserved), and the sequence names in this file would be used for output file formats.

In a similar manner, is is possible that, at earlier parts of your pipeline, some options can be changed to modify the sequence names? Hope this helps. (I understand that my solution is very case-specific and is not even related to samtools...)

harris-yh-wong avatar Nov 11 '20 09:11 harris-yh-wong

IGV_2.3.80 working good!

lj365146534 avatar Apr 10 '23 07:04 lj365146534