DamageProfiler
DamageProfiler copied to clipboard
Relax reference name validation with ValidationStringency
Right now, the (default ?) reference name validation stringency of htsjdk is pretty strict, leading to errors when reference names in alignment files are ill-formated (for example, the refererence names in the metaphlan database). This should be relaxed with ValidationStringency to allow for non-perfectly formatted reference names. CC @apeltzer @JudithNeukamm
Thanks for your comment. The ValidationStringency is set to LENIENT by default which emits warnings but keeps the run going if possible. Did you get a wrong output or did the tool just throw a warning?
$ damageprofiler -i metagenomebis.all_mapped.bam -r mpa_db_latest.fa -o damageprofiler
DamageProfiler v0.4.6
Invalid SAM/BAM file. Please check your file.
htsjdk.samtools.SAMException: Sequence name '157592__A0A150IGK6__fliD,flbC,flaV' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*'
Version: 0.4.6, installed via Conda
Test files are unfortunately too big to attach. The error is with the characters in the sequence name, not allowed by the regex (in the example above, a ,)
I did a small test file, and changing the ValidationStringency to 'SILENT' does not solve the problem, unfortunately. I will try to solve this until the next release.
To be honest, thats also a quite invalid FastA header 🙄 157592__A0A150IGK6__fliD,flbC,flaV 🤦
Agree, Metaphlan uses funny reference names. Though, for example, this is valid: 157592__A0A150IGK6__fliD;flbC;flaV
Unfortunately, I couldn't solve this problem. It's not influenced by the ValidationStringency parameter, and there doesn't seem to be an option to set a user-defined regex pattern. It might be an option to contact the developer of metaphlan to make them aware of this issue? Or to fix the file before running DaamageProfiler. If you find any solution to solve this within the code, I'll happy to include it.