gatk
gatk copied to clipboard
BQSR can produce tables with missing ReadGroups
BaseRecalibrator can create a recalibration table which doesn't contain all the readgroups in the input sample. Then when ApplyBQSR is run on the same file with that table it crashes with an exception.
ex:
java.lang.IllegalStateException: The covariates table is missing ReadGroup GROUP_1 in RecalTable0
This can happen because BaseRecalibrator used a stricter set of filters than ApplyBQSR. If all the reads in a read group are filtered in BQSR then that read group will be missing from the recalibration table and fail when running ApplyBQSR. A possible example is a readgroup that only has duplicates or secondary reads.
https://gatkforums.broadinstitute.org/gatk/discussion/11546/gatk4-error-java-lang-illegalstateexception-the-covariates-table-is-missing-readgroup is a possible example of this error.
Is the issue fixed? I ran into the same error and after checking out GATK from GitHub and with the latest build (commit 031c40773d2aefd289005319d6880dbeb3c1dec9, gatk-4.1.4.1-83-g031c407-SNAPSHOT), the problem persisted.
@shengzha Not yet, though I am working on it (this ticket will transition to the closed state when a fix has been merged). In the meantime if you have a small test case that illustrates the problem, please add it here, as that would help ensure the fix addresses the specific issue you're seeing. Thanks.
@shengzha Not yet, though I am working on it (this ticket will transition to the closed state when a fix has been merged). In the meantime if you have a small test case that illustrates the problem, please add it here, as that would help ensure the fix addresses the specific issue you're seeing. Thanks.
@cmnbroad Thank you so much for the reply. I don't have a small test case for you, but I can provide some other information.
It is RNA seq data and passes validation check (java -jar picard.jar ValidateSamFile I=S3_2.unmapped.split.bam MODE=SUMMARY).
BaseRecalibrator cmd:
gatk BaseRecalibrator -R Homo_sapiens_assembly38.fasta -I S3_2.unmapped.split.bam --use-original-qualities -O S3_2.unmapped.recal_data.csv -known-sites Homo_sapiens_assembly38.dbsnp138.vcf -known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz
ApplyBQSR cmd:
gatk ApplyBQSR -R Homo_sapiens_assembly38.fasta -I S3_2.unmapped.split.bam -O S3_2.unmapped.aligned.duplicates_marked.recalibrated.bam -bqsr S3_2.unmapped.recal_data.csv --add-output-sam-program-record --use-original-qualities
RecalTables in S3_2.unmapped.recal_data.csv are empty.
Here is the screen dump of BaseRecalibrator and ApplyBQSR. BaseRecalibrator
Using GATK jar <XXX>/gatk-4.1.4.1-83-g031c407-SNAPSHOT/gatk-package-4.1.4.1-83-g031c407-SNAPSHOT-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar <XXX>/gatk-4.1.4.1-83-g031c407-SNAPSHOT/gatk-package-4.1.4.1-83-g031c407-SNAPSHOT-local.jar BaseRecalibrator -R Homo_sapiens_assembly38.fasta -I S3_2.unmapped.split.bam --use-original-qualities -O S3_2.unmapped.recal_data.csv -known-sites Homo_sapiens_assembly38.dbsnp138.vcf -known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz
23:39:34.668 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:<XXX>/gatk-4.1.4.1-83-g031c407-SNAPSHOT/gatk-package-4.1.4.1-83-g031c407-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 26, 2020 11:39:34 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
23:39:34.915 INFO BaseRecalibrator - ------------------------------------------------------------
23:39:34.915 INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.4.1-83-g031c407-SNAPSHOT
23:39:34.915 INFO BaseRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
23:39:34.915 INFO BaseRecalibrator - Executing as <XXX@XXX> on Linux v3.10.0-957.12.1.el7.x86_64 amd64
23:39:34.915 INFO BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-b08
23:39:34.916 INFO BaseRecalibrator - Start Date/Time: February 26, 2020 11:39:34 PM EST
23:39:34.916 INFO BaseRecalibrator - ------------------------------------------------------------
23:39:34.916 INFO BaseRecalibrator - ------------------------------------------------------------
23:39:34.916 INFO BaseRecalibrator - HTSJDK Version: 2.21.2
23:39:34.916 INFO BaseRecalibrator - Picard Version: 2.21.9
23:39:34.916 INFO BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:39:34.916 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:39:34.916 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:39:34.916 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:39:34.916 INFO BaseRecalibrator - Deflater: IntelDeflater
23:39:34.917 INFO BaseRecalibrator - Inflater: IntelInflater
23:39:34.917 INFO BaseRecalibrator - GCS max retries/reopens: 20
23:39:34.917 INFO BaseRecalibrator - Requester pays: disabled
23:39:34.917 INFO BaseRecalibrator - Initializing engine
23:39:35.461 INFO FeatureManager - Using codec VCFCodec to read file file://<XXX>/Homo_sapiens_assembly38.dbsnp138.vcf
23:39:35.595 INFO FeatureManager - Using codec VCFCodec to read file file://<XXX>/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
23:39:35.721 INFO FeatureManager - Using codec VCFCodec to read file file://<XXX>/Homo_sapiens_assembly38.known_indels.vcf.gz
23:39:35.896 INFO BaseRecalibrator - Done initializing engine
23:39:35.913 INFO BaseRecalibrationEngine - The covariates being used here:
23:39:35.913 INFO BaseRecalibrationEngine - ReadGroupCovariate
23:39:35.913 INFO BaseRecalibrationEngine - QualityScoreCovariate
23:39:35.913 INFO BaseRecalibrationEngine - ContextCovariate
23:39:35.914 INFO BaseRecalibrationEngine - CycleCovariate
23:39:35.965 INFO ProgressMeter - Starting traversal
23:39:35.965 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
23:40:38.767 INFO BaseRecalibrator - 62393454 read(s) filtered by: MappingQualityNotZeroReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: WellformedReadFilter
62393454 total reads filtered
23:40:38.769 INFO ProgressMeter - unmapped 1.0 0 0.0
23:40:38.769 INFO ProgressMeter - Traversal complete. Processed 0 total reads in 1.0 minutes.
23:40:38.770 INFO BaseRecalibrator - Calculating quantized quality scores...
23:40:38.784 INFO BaseRecalibrator - Writing recalibration report...
23:40:38.821 INFO BaseRecalibrator - ...done!
23:40:38.822 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads
23:40:38.822 INFO BaseRecalibrator - Shutting down engine
[February 26, 2020 11:40:38 PM EST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 1.07 minutes.
Runtime.totalMemory()=3622305792
Tool returned:
SUCCESS
ApplyBQSR
23:41:16.951 INFO ApplyBQSR - The Genome Analysis Toolkit (GATK) v4.1.4.1-83-g031c407-SNAPSHOT
23:41:16.952 INFO ApplyBQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
23:41:16.952 INFO ApplyBQSR - Executing as <XXX@XXX> on Linux v3.10.0-957.12.1.el7.x86_64 amd64
23:41:16.952 INFO ApplyBQSR - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-b08
23:41:16.952 INFO ApplyBQSR - Start Date/Time: February 26, 2020 11:41:16 PM EST
23:41:16.952 INFO ApplyBQSR - ------------------------------------------------------------
23:41:16.952 INFO ApplyBQSR - ------------------------------------------------------------
23:41:16.953 INFO ApplyBQSR - HTSJDK Version: 2.21.2
23:41:16.953 INFO ApplyBQSR - Picard Version: 2.21.9
23:41:16.953 INFO ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:41:16.953 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:41:16.953 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:41:16.953 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:41:16.953 INFO ApplyBQSR - Deflater: IntelDeflater
23:41:16.953 INFO ApplyBQSR - Inflater: IntelInflater
23:41:16.953 INFO ApplyBQSR - GCS max retries/reopens: 20
23:41:16.953 INFO ApplyBQSR - Requester pays: disabled
23:41:16.953 INFO ApplyBQSR - Initializing engine
23:41:17.460 INFO ApplyBQSR - Done initializing engine
23:41:17.527 INFO ProgressMeter - Starting traversal
23:41:17.527 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
23:41:17.849 INFO ApplyBQSR - Shutting down engine
[February 26, 2020 11:41:17 PM EST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2411724800
java.lang.IllegalStateException: The covariates table is missing ReadGroup S3_2 in RecalTable0
at org.broadinstitute.hellbender.utils.Utils.validate(Utils.java:752)
at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.keyForReadGroup(ReadGroupCovariate.java:81)
at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.recordValues(ReadGroupCovariate.java:53)
at org.broadinstitute.hellbender.utils.recalibration.covariates.StandardCovariateList.recordAllValuesInStorage(StandardCovariateList.java:133)
at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:546)
at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:527)
at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:145)
at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:27)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
@cmnbroad And here is the output of BaseRecalibrator. S3_2.unmapped.recal_data.csv.zip
@shengzha In your case it looks like ApplyBQSR is failing because ALL of your input reads are being filtered out before BQSR ever sees them, due to having mapping quality 0:
23:40:38.767 INFO BaseRecalibrator - 62393454 read(s) filtered by: MappingQualityNotZeroReadFilter
...
23:40:38.822 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads
@cmnbroad Thanks, I didn't notice that. I will check and get back to you later.
The root cause might be irrelevant but I am still posting it in case someone is in a similar situation. I was running the GATK workflow https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels on a local machine and it turned out that STAR alignment
STAR \
--genomeDir STAR2_5 \
--runThreadN ${threads} \
--readFilesIn ${fastq1} ${fastq2} \
--readFilesCommand "gunzip -c" \
${"--sjdbOverhang "+(read_length-1)} \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--limitBAMsortRAM ${star_mem+"000000000"} \
--limitOutSJcollapsed ${default=1000000 star_limitOutSJcollapsed} \
--outFileNamePrefix ${base_name}.
was not working fine because --readFilesCommand "gunzip -c" is not supported locally.
https://groups.google.com/forum/#!topic/rna-star/EkTo556lOMk this post helps.
@shengzha In your case it looks like ApplyBQSR is failing because ALL of your input reads are being filtered out before BQSR ever sees them, due to having mapping quality 0:
23:40:38.767 INFO BaseRecalibrator - 62393454 read(s) filtered by: MappingQualityNotZeroReadFilter...23:40:38.822 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads
Hi, I also encounted the trable. And it looks like ApplyBQSR is failing is truelly caused by 0 reads after filtered in BQSR processing. I solved my problem in BQSR to generate recal_table.
So maybe my experience is helpful to you or others.
Q: "on the ApplyBQSR step with an error stating "The covariates table is missing ReadGroup xxx in RecalTable0"
A: check up reads remained in BQSR, such as filtering by regions: --intervals xxx.bed, and no reads in target regions.,and other filtering situations.
I still have the same issues, what can we do to get the Number of read groups must be >= 1. Thanks
I have the same issues,but finaly I found that I made a mistake about
bwa -R "@RG\tID:${x} \tPL:Illumina\tLB:${x}\tSM:${x}\tPG:bwa"
There is a space after the ID.
in fact ,result of gatk BaseRecalibrator Contains the space.
but when gatk ApplyBQSR read the table,
it will trim trailing spaces.
different names cause errors(I guess)
Resolved by #9020.