gatk
gatk copied to clipboard
FastaAlternateReferenceMaker outputs empty fasta files for select genes from select VCF files
I am attempting to create fasta files from specific regions in approximately 270 VCF files. For every other region/gene I've looked at, I have not had this issue. For one particular region (mrr1), I am getting the error seen below. I checked the coverage of the bam file and viewed the vcf in IGV viewer, but notice no problems. Can you please advise? Thank you.
Similar to this issue, but am still not sure how to approach it? https://github.com/broadinstitute/gatk/issues/6260#issue-521418442
Bash script:
#!/bin/bash --login
#SBATCH --time=1:00:00 # limit of wall clock time - how long the job will run
#SBATCH --ntasks=1 # number of tasks - how many tasks (nodes) that you requir
#SBATCH --cpus-per-task=1 # number of CPUs (or cores) per task (same as -c)
#SBATCH --mem=50G # memory required per node - amount of memory (in bytes)
#SBATCH --job-name=VCF_FastaNEP_CCR
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#SBATCH -o SpeciesID_CCR7_slurm
cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/
module load Java/JDK12
for sample in AI7 W18 B5 BU9 I9 R23 Y1
do
base=$(basename ${sample})
gatk-4.2.5.0/gatk SelectVariants -R /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/0_DNAscripts/ReferenceGenome/Botrytis_cinerea.ASM83294v1.dna.toplevel.fa -V /mn
t/research/Hausbeck_group/Lukasko/BotrytisDNASeq/10_FilteredVCF/Plates123/BcinereaP123.SNVonly.filteredPASS_renamed.vcf -sn ${sample} --remove-unused-alternates --exclu
de-sample-name /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/ConservedGenes/ExcludeList.args -O /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Cons
ervedGenes/VCFs/${base}.vcf
gatk-4.2.5.0/gatk FastaAlternateReferenceMaker -R /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/0_DNAscripts/ReferenceGenome/Botrytis_cinerea.ASM83294v1.dna.tople
vel.fa -O /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/ConservedGenes/mrr1/${base}_mrr1.fasta -L 5:680219-684662 -V /mnt/research/Hausbeck_group/Lukasko
/BotrytisDNASeq/CCR7/ConservedGenes/VCFs/${base}.vcf
done
#< /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/10_FilteredVCF/Plates123/Newnames123SNVINDEL.txt
scontrol show job $SLURM_JOB_ID
Slurm:
15:33:06.612 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
15:33:06.612 INFO FastaAlternateReferenceMaker - The Genome Analysis Toolkit (GATK) v4.2.5.0
15:33:06.613 INFO FastaAlternateReferenceMaker - For support and documentation go to https://software.broadinstitute.org/gatk/
15:33:06.613 INFO FastaAlternateReferenceMaker - Executing as lukaskon@lac-307 on Linux v3.10.0-1160.80.1.el7.x86_64 amd64
15:33:06.613 INFO FastaAlternateReferenceMaker - Java runtime: Java HotSpot(TM) 64-Bit Server VM v12+33
15:33:06.614 INFO FastaAlternateReferenceMaker - Start Date/Time: July 18, 2023 at 3:33:06 PM EDT
15:33:06.614 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
15:33:06.614 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
15:33:06.615 INFO FastaAlternateReferenceMaker - HTSJDK Version: 2.24.1
15:33:06.616 INFO FastaAlternateReferenceMaker - Picard Version: 2.25.4
15:33:06.616 INFO FastaAlternateReferenceMaker - Built for Spark Version: 2.4.5
15:33:06.616 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:33:06.617 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:33:06.617 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:33:06.617 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:33:06.618 INFO FastaAlternateReferenceMaker - Deflater: IntelDeflater
15:33:06.618 INFO FastaAlternateReferenceMaker - Inflater: IntelInflater
15:33:06.618 INFO FastaAlternateReferenceMaker - GCS max retries/reopens: 20
15:33:06.619 INFO FastaAlternateReferenceMaker - Requester pays: disabled
15:33:06.619 INFO FastaAlternateReferenceMaker - Initializing engine
15:33:06.870 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/ConservedGenes/VCFs/AI7.vcf
15:33:06.936 INFO IntervalArgumentCollection - Processing 4444 bp from intervals
15:33:06.939 INFO FastaAlternateReferenceMaker - Done initializing engine
15:33:06.949 INFO ProgressMeter - Starting traversal
15:33:06.949 INFO ProgressMeter - Current Locus Elapsed Minutes Bases Processed Bases/Minute
15:33:07.194 INFO FastaAlternateReferenceMaker - Shutting down engine
[July 18, 2023 at 3:33:07 PM EDT] org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2076049408
java.lang.IllegalArgumentException: Illegal base [ ] seen in the allele
at htsjdk.variant.variantcontext.Allele.create(Allele.java:251)
at htsjdk.variant.variantcontext.Allele.create(Allele.java:402)
at org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker.lambda$handlePosition$0(FastaAlternateReferenceMaker.java:176)
at java.base/java.util.Optional.orElseGet(Optional.java:369)
at org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker.handlePosition(FastaAlternateReferenceMaker.java:176)
at org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker.apply(FastaAlternateReferenceMaker.java:141)
at org.broadinstitute.hellbender.engine.ReferenceWalker.traverse(ReferenceWalker.java:55)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
@lukaskon Thanks for the report. This is almost certainly a bug on our end. It looks like it's happening at sites that are only covered by a spanning deletion. We're trying to create an allele internally with a space but code elsewhere is rejecting that. I suspect it used to work but we rewrote the allele handling code a while ago and didn't have a test for this particular edge case.
Thank you for the quick response! Do you know of another tool that can do the same (or very similar) task for the time being?
If it's helpful, I can provide a couple samples for you to test. The VCFs that worked/didnt work appear to be homologous in the region that I specified (according to IGV), so I not sure a deletion is causing the issue?
I'm sorry, I don't know another tool that does the same thing. If you could provide some samples that cause the problem that would actually be really helpful. It should be a simple fix so I could try to get a path out soon, but it's always faster with a test case.
Sure. If you provide your email, I can send a couple VCFs that worked/did not work along with the reference genome?
Thank you. You an send them to [email protected]. If the files are more than a few MB they might not email though. I'm assuming this non-human data and therefore not subject to human privacy rules?
Correct, these are fungal genomes. I just sent an email with a dropbox link, let me know if you need anything else. Thanks for looking at it!
Hello, just inquiring if any solution has been found for this issue? Thank you.
Hello, I seem to be seeing this same issue in trying to use FastaAlternateReferenceMaker. I was curious if any solutions have been found? Thank you.