gatk
gatk copied to clipboard
Misleading error message about multi-sample when no read groups are present
HaplotypeCaller checks for samplesList.numberOfSamples() != 1
. The idea was probably about detecting cases with > 1
, but when no @RG
present in the BAM file (i.e. == 0
), it throws the same error. Obviously, such files are not multi-sample, so ideally HaplotypeCaller should just treat them normally without any error. If it's not possible, at least make more informative error message indicating that the problem is having no read groups at all, not "multi-sample BAM file".
https://github.com/broadinstitute/gatk/blob/1353e3201bb11e29039efd89359b0a4cfc11e5c0/src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java#L279
Thanks for the report. That is confusing. We'll fix that.
@lbergelson am I reading the spec wrong? I thought @RG
was required?
Oh, I meant we should fix the error message so it accurately tells you what's wrong.
Got it. Then I guess I'm arguing against @aushev's expected behavior that
HaplotypeCaller should just treat them normally without any error
Laura if you don't mind - is @RG
really necessary for HaplotypeCaller's work? I understand that it is defined in the specs but is it critical for the functioning, if we assume that each sample is one library?
In my case, I was following a pipeline involving STAR
mapping -> MarkDuplicates
-> SplitNCigarReads
-> HaplotypeCaller
. As I understand, normally I should have added @RG
s at the mapping step. Neither MarkDuplicates
nor SplitNCigarReads
gave any warning about it, so I only figured the problem at the final step. I have then added totally fake @RG
s to my files just after SplitNCigarReads step, and it worked - then why HaplotypeCaller can't do something similar in case when RGs are not present?
Thanks!
Got it. Then I guess I'm arguing against @aushev's expected behavior that
HaplotypeCaller should just treat them normally without any error
I really don't want to assume that the data is correct because in the event that a user does try to run GVCF mode on a multi-sample bam without an @RG header line then their results are going to be terrible. It's unlikely, but it would be very bad and potentially not obvious to debug.
got it, thanks for explaining!
Well then - as a feature request for future versions - do you think it would be possible to add the check to other units like MarkDuplicates or SplitNCigarReads so that they give a warning when there's no @RG
in the file?
I opened #6511 for SplitNCigarReads and https://github.com/broadinstitute/picard/issues/1488 for MarkDuplicates. They should both be easy to address once we decide on warning or error.
Sounds good, thank you!
From discussion over in the Picard repo, I've realized that I read the spec wrong. In that case, I'm willing to let HaplotypeCaller "power through" after emitting a warning if there is no read group header.
I thought this was one of the things we enforced in gatk despite knowing it's not in the spec? We might need to update a bunch of stuff if we allow for no readgroups. Things like bqsr covariates might start getting weird nulls.
gatk-4.4.0.0 out there and still, gatk HaplotypeCaller --sample-name 7-Wu-FF ...
bails on me with:
A USER ERROR has occurred: Argument --sample_name has a bad value: Specified name does not exist in input bam files
Indeed, there is not @RG
line at all, how how about adding an extra error/warning message somewhere more above in the code path?
$ samtools view -H 7-Wu-FF.sorted.bam
@HD VN:1.6 SO:coordinate
@SQ SN:7-Wu-FF LN:443
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 32 -o exact_matches/7-Wu-FF.sam 7-Wu-FF.fasta 7-Wu-FF_R1.fastq.gz 7-Wu-FF_R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.17 CL:samtools sort 7-Wu-FF.sorted.bam 7-Wu-FF.sam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.17 CL:samtools view -H 7-Wu-FF.sorted.bam
$
Like others I agree that if there are no samples found by GATK then all data should be used.
I also think that @SQ SN:
is enough for a sanity check. No need for @RG SN:
tag.
Finally, fix the --sample_name
with --sample-name
in the error message output, the old syntax with an underscore is not accepted anymore.