gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Misleading error message about multi-sample when no read groups are present

Open aushev opened this issue 4 years ago • 12 comments

HaplotypeCaller checks for samplesList.numberOfSamples() != 1. The idea was probably about detecting cases with > 1, but when no @RG present in the BAM file (i.e. == 0), it throws the same error. Obviously, such files are not multi-sample, so ideally HaplotypeCaller should just treat them normally without any error. If it's not possible, at least make more informative error message indicating that the problem is having no read groups at all, not "multi-sample BAM file".

https://github.com/broadinstitute/gatk/blob/1353e3201bb11e29039efd89359b0a4cfc11e5c0/src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java#L279

aushev avatar Mar 13 '20 03:03 aushev

Thanks for the report. That is confusing. We'll fix that.

lbergelson avatar Mar 16 '20 01:03 lbergelson

@lbergelson am I reading the spec wrong? I thought @RG was required?

ldgauthier avatar Mar 16 '20 15:03 ldgauthier

Oh, I meant we should fix the error message so it accurately tells you what's wrong.

lbergelson avatar Mar 16 '20 15:03 lbergelson

Got it. Then I guess I'm arguing against @aushev's expected behavior that

HaplotypeCaller should just treat them normally without any error

ldgauthier avatar Mar 16 '20 16:03 ldgauthier

Laura if you don't mind - is @RG really necessary for HaplotypeCaller's work? I understand that it is defined in the specs but is it critical for the functioning, if we assume that each sample is one library? In my case, I was following a pipeline involving STAR mapping -> MarkDuplicates -> SplitNCigarReads -> HaplotypeCaller. As I understand, normally I should have added @RGs at the mapping step. Neither MarkDuplicates nor SplitNCigarReads gave any warning about it, so I only figured the problem at the final step. I have then added totally fake @RGs to my files just after SplitNCigarReads step, and it worked - then why HaplotypeCaller can't do something similar in case when RGs are not present? Thanks!

Got it. Then I guess I'm arguing against @aushev's expected behavior that

HaplotypeCaller should just treat them normally without any error

aushev avatar Mar 16 '20 19:03 aushev

I really don't want to assume that the data is correct because in the event that a user does try to run GVCF mode on a multi-sample bam without an @RG header line then their results are going to be terrible. It's unlikely, but it would be very bad and potentially not obvious to debug.

ldgauthier avatar Mar 18 '20 20:03 ldgauthier

got it, thanks for explaining! Well then - as a feature request for future versions - do you think it would be possible to add the check to other units like MarkDuplicates or SplitNCigarReads so that they give a warning when there's no @RG in the file?

aushev avatar Mar 18 '20 21:03 aushev

I opened #6511 for SplitNCigarReads and https://github.com/broadinstitute/picard/issues/1488 for MarkDuplicates. They should both be easy to address once we decide on warning or error.

ldgauthier avatar Mar 19 '20 20:03 ldgauthier

Sounds good, thank you!

aushev avatar Mar 19 '20 20:03 aushev

From discussion over in the Picard repo, I've realized that I read the spec wrong. In that case, I'm willing to let HaplotypeCaller "power through" after emitting a warning if there is no read group header.

ldgauthier avatar Mar 20 '20 18:03 ldgauthier

I thought this was one of the things we enforced in gatk despite knowing it's not in the spec? We might need to update a bunch of stuff if we allow for no readgroups. Things like bqsr covariates might start getting weird nulls.

lbergelson avatar Mar 23 '20 14:03 lbergelson

gatk-4.4.0.0 out there and still, gatk HaplotypeCaller --sample-name 7-Wu-FF ... bails on me with:

A USER ERROR has occurred: Argument --sample_name has a bad value: Specified name does not exist in input bam files

Indeed, there is not @RG line at all, how how about adding an extra error/warning message somewhere more above in the code path?

$ samtools view -H 7-Wu-FF.sorted.bam
@HD     VN:1.6  SO:coordinate
@SQ     SN:7-Wu-FF      LN:443
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 32 -o exact_matches/7-Wu-FF.sam 7-Wu-FF.fasta 7-Wu-FF_R1.fastq.gz 7-Wu-FF_R2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.17 CL:samtools sort 7-Wu-FF.sorted.bam 7-Wu-FF.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.17 CL:samtools view -H 7-Wu-FF.sorted.bam
$

Like others I agree that if there are no samples found by GATK then all data should be used.

I also think that @SQ SN: is enough for a sanity check. No need for @RG SN: tag.

Finally, fix the --sample_name with --sample-name in the error message output, the old syntax with an underscore is not accepted anymore.

mmokrejs avatar Nov 29 '23 14:11 mmokrejs