noodles icon indicating copy to clipboard operation
noodles copied to clipboard

bam/writer, sam/writer: Validate cigar/sequence/basequalities

Open veldsla opened this issue 2 years ago • 4 comments

To avoid writing corrupt or incompatible SAM/BAM files an extra validation step is applied before writing. This makes sure:

  • The number of base qualities is equal to the sequence length or empty
  • The length of the cigar operations matches the sequence length

Closes #59

veldsla avatar Nov 24 '21 13:11 veldsla

This probably needs renaming or moving of stuff. Let me know what you think!

I forgot to edit the async writers. Will update tomorrow!

veldsla avatar Nov 24 '21 13:11 veldsla

Sorry, this is a bit too out of scope of what was initially suggested in #59. I expected the same logic used in the (BAM) SAM record writer to be ported to the BAM record writer. Can you scale it back to just that?

zaeleus avatar Nov 25 '21 00:11 zaeleus

I agree the file corruption issue is a different problem from writing sam/bam files that do not conform to the specification. Unfortunately the record writing code is duplicated 6 times. This way the validation is only written twice.

Fixing only the length/corruption issue does not magically make the incompatibility issue go away.

veldsla avatar Nov 25 '21 09:11 veldsla

Apparently using this branch on real world bam files triggers errors, but not in the test suite. Will investigate, converted PR to WIP.

veldsla avatar Nov 26 '21 15:11 veldsla

Thanks for looking at this in the past. I'm closing this since the alignment parsers/writers have now diverged greatly and makes the same checks.

zaeleus avatar Apr 27 '23 20:04 zaeleus