DamageProfiler icon indicating copy to clipboard operation
DamageProfiler copied to clipboard

Relax reference name validation with ValidationStringency

Open maxibor opened this issue 5 years ago • 6 comments

Right now, the (default ?) reference name validation stringency of htsjdk is pretty strict, leading to errors when reference names in alignment files are ill-formated (for example, the refererence names in the metaphlan database). This should be relaxed with ValidationStringency to allow for non-perfectly formatted reference names. CC @apeltzer @JudithNeukamm

maxibor avatar Jun 28 '20 10:06 maxibor

Thanks for your comment. The ValidationStringency is set to LENIENT by default which emits warnings but keeps the run going if possible. Did you get a wrong output or did the tool just throw a warning?

JudithNeukamm avatar Jun 29 '20 07:06 JudithNeukamm

$ damageprofiler -i metagenomebis.all_mapped.bam -r mpa_db_latest.fa -o damageprofiler
DamageProfiler v0.4.6
Invalid SAM/BAM file. Please check your file.
htsjdk.samtools.SAMException: Sequence name '157592__A0A150IGK6__fliD,flbC,flaV' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*'

Version: 0.4.6, installed via Conda Test files are unfortunately too big to attach. The error is with the characters in the sequence name, not allowed by the regex (in the example above, a ,)

maxibor avatar Jun 29 '20 12:06 maxibor

I did a small test file, and changing the ValidationStringency to 'SILENT' does not solve the problem, unfortunately. I will try to solve this until the next release.

JudithNeukamm avatar Jun 29 '20 15:06 JudithNeukamm

To be honest, thats also a quite invalid FastA header 🙄 157592__A0A150IGK6__fliD,flbC,flaV 🤦

apeltzer avatar Jun 29 '20 15:06 apeltzer

Agree, Metaphlan uses funny reference names. Though, for example, this is valid: 157592__A0A150IGK6__fliD;flbC;flaV

maxibor avatar Jul 06 '20 10:07 maxibor

Unfortunately, I couldn't solve this problem. It's not influenced by the ValidationStringency parameter, and there doesn't seem to be an option to set a user-defined regex pattern. It might be an option to contact the developer of metaphlan to make them aware of this issue? Or to fix the file before running DaamageProfiler. If you find any solution to solve this within the code, I'll happy to include it.

JudithNeukamm avatar Aug 05 '20 10:08 JudithNeukamm