goleft icon indicating copy to clipboard operation
goleft copied to clipboard

Differential header stringency depending on file format

Open ernfrid opened this issue 5 years ago • 4 comments

I noticed this in I'm observing differential error checking depending on whether or not the input file is a CRAM or a BAM. The files in question have multiple sample names listed (a different one for each @RG line). When the input is a CRAM, no error is thrown. When the input is a BAM, I see: panic: bam reagroup: more than one RG for /build/test.bam

At the moment, it seems as if indexcov doesn't check CRAM headers? i.e. https://github.com/brentp/goleft/blob/master/indexcov/indexcov.go#L202-L231

I assume this error is thrown because the assumption is that there is a single sample for the whole file and there isn't handling of multiple samples. What is being reported when these problem CRAMs are provided? Stats for all the samples pooled together?

ernfrid avatar Nov 14 '18 22:11 ernfrid

For context, I'm seeing this error when running Smoove.

ernfrid avatar Nov 14 '18 22:11 ernfrid

yes, for CRAM it will report the sum of all samples. it should be checked in CRAM too. the index can't know about the different samples.

brentp avatar Nov 14 '18 22:11 brentp

smoove won't work with multiple samples per bam (I don't think lumpy will either).

brentp avatar Nov 14 '18 22:11 brentp

Yeah, that's what I'd expect. They're not actually multiple samples though...just mislabeled single samples. If by happy circumstance it was all erroneously analyzed as a single sample, then that may not be the worst thing...

ernfrid avatar Nov 14 '18 22:11 ernfrid