gatk icon indicating copy to clipboard operation
gatk copied to clipboard

FastaAlternateReferenceMaker outputs empty fasta files for select genes from select VCF files

Open lukaskon opened this issue 2 years ago • 9 comments

I am attempting to create fasta files from specific regions in approximately 270 VCF files. For every other region/gene I've looked at, I have not had this issue. For one particular region (mrr1), I am getting the error seen below. I checked the coverage of the bam file and viewed the vcf in IGV viewer, but notice no problems. Can you please advise? Thank you.

Similar to this issue, but am still not sure how to approach it? https://github.com/broadinstitute/gatk/issues/6260#issue-521418442

Bash script:

#!/bin/bash --login
#SBATCH --time=1:00:00             # limit of wall clock time - how long the job will run
#SBATCH --ntasks=1                  # number of tasks - how many tasks (nodes) that you requir
#SBATCH --cpus-per-task=1           # number of CPUs (or cores) per task (same as -c)
#SBATCH --mem=50G                    # memory required per node - amount of memory (in bytes)
#SBATCH --job-name=VCF_FastaNEP_CCR
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#SBATCH -o SpeciesID_CCR7_slurm


cd /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/

module load Java/JDK12

for sample in AI7 W18 B5 BU9 I9 R23 Y1
do
base=$(basename ${sample})

gatk-4.2.5.0/gatk SelectVariants -R /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/0_DNAscripts/ReferenceGenome/Botrytis_cinerea.ASM83294v1.dna.toplevel.fa -V /mn
t/research/Hausbeck_group/Lukasko/BotrytisDNASeq/10_FilteredVCF/Plates123/BcinereaP123.SNVonly.filteredPASS_renamed.vcf -sn ${sample} --remove-unused-alternates --exclu
de-sample-name /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/ConservedGenes/ExcludeList.args -O /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/Cons
ervedGenes/VCFs/${base}.vcf

gatk-4.2.5.0/gatk FastaAlternateReferenceMaker -R /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/0_DNAscripts/ReferenceGenome/Botrytis_cinerea.ASM83294v1.dna.tople
vel.fa -O /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/ConservedGenes/mrr1/${base}_mrr1.fasta -L 5:680219-684662 -V /mnt/research/Hausbeck_group/Lukasko
/BotrytisDNASeq/CCR7/ConservedGenes/VCFs/${base}.vcf


done
#< /mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/10_FilteredVCF/Plates123/Newnames123SNVINDEL.txt
scontrol show job $SLURM_JOB_ID

Slurm:

15:33:06.612 INFO  FastaAlternateReferenceMaker - ------------------------------------------------------------
15:33:06.612 INFO  FastaAlternateReferenceMaker - The Genome Analysis Toolkit (GATK) v4.2.5.0
15:33:06.613 INFO  FastaAlternateReferenceMaker - For support and documentation go to https://software.broadinstitute.org/gatk/
15:33:06.613 INFO  FastaAlternateReferenceMaker - Executing as lukaskon@lac-307 on Linux v3.10.0-1160.80.1.el7.x86_64 amd64
15:33:06.613 INFO  FastaAlternateReferenceMaker - Java runtime: Java HotSpot(TM) 64-Bit Server VM v12+33
15:33:06.614 INFO  FastaAlternateReferenceMaker - Start Date/Time: July 18, 2023 at 3:33:06 PM EDT
15:33:06.614 INFO  FastaAlternateReferenceMaker - ------------------------------------------------------------
15:33:06.614 INFO  FastaAlternateReferenceMaker - ------------------------------------------------------------
15:33:06.615 INFO  FastaAlternateReferenceMaker - HTSJDK Version: 2.24.1
15:33:06.616 INFO  FastaAlternateReferenceMaker - Picard Version: 2.25.4
15:33:06.616 INFO  FastaAlternateReferenceMaker - Built for Spark Version: 2.4.5
15:33:06.616 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:33:06.617 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:33:06.617 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:33:06.617 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:33:06.618 INFO  FastaAlternateReferenceMaker - Deflater: IntelDeflater
15:33:06.618 INFO  FastaAlternateReferenceMaker - Inflater: IntelInflater
15:33:06.618 INFO  FastaAlternateReferenceMaker - GCS max retries/reopens: 20
15:33:06.619 INFO  FastaAlternateReferenceMaker - Requester pays: disabled
15:33:06.619 INFO  FastaAlternateReferenceMaker - Initializing engine
15:33:06.870 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/research/Hausbeck_group/Lukasko/BotrytisDNASeq/CCR7/ConservedGenes/VCFs/AI7.vcf
15:33:06.936 INFO  IntervalArgumentCollection - Processing 4444 bp from intervals
15:33:06.939 INFO  FastaAlternateReferenceMaker - Done initializing engine
15:33:06.949 INFO  ProgressMeter - Starting traversal
15:33:06.949 INFO  ProgressMeter -        Current Locus  Elapsed Minutes       Bases Processed     Bases/Minute
15:33:07.194 INFO  FastaAlternateReferenceMaker - Shutting down engine
[July 18, 2023 at 3:33:07 PM EDT] org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2076049408
java.lang.IllegalArgumentException: Illegal base [ ] seen in the allele
        at htsjdk.variant.variantcontext.Allele.create(Allele.java:251)
        at htsjdk.variant.variantcontext.Allele.create(Allele.java:402)
        at org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker.lambda$handlePosition$0(FastaAlternateReferenceMaker.java:176)
        at java.base/java.util.Optional.orElseGet(Optional.java:369)
        at org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker.handlePosition(FastaAlternateReferenceMaker.java:176)
        at org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker.apply(FastaAlternateReferenceMaker.java:141)
        at org.broadinstitute.hellbender.engine.ReferenceWalker.traverse(ReferenceWalker.java:55)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

lukaskon avatar Jul 18 '23 21:07 lukaskon

@lukaskon Thanks for the report. This is almost certainly a bug on our end. It looks like it's happening at sites that are only covered by a spanning deletion. We're trying to create an allele internally with a space but code elsewhere is rejecting that. I suspect it used to work but we rewrote the allele handling code a while ago and didn't have a test for this particular edge case.

lbergelson avatar Jul 20 '23 16:07 lbergelson

Thank you for the quick response! Do you know of another tool that can do the same (or very similar) task for the time being?

lukaskon avatar Jul 20 '23 21:07 lukaskon

If it's helpful, I can provide a couple samples for you to test. The VCFs that worked/didnt work appear to be homologous in the region that I specified (according to IGV), so I not sure a deletion is causing the issue?

lukaskon avatar Jul 21 '23 20:07 lukaskon

I'm sorry, I don't know another tool that does the same thing. If you could provide some samples that cause the problem that would actually be really helpful. It should be a simple fix so I could try to get a path out soon, but it's always faster with a test case.

lbergelson avatar Jul 21 '23 21:07 lbergelson

Sure. If you provide your email, I can send a couple VCFs that worked/did not work along with the reference genome?

lukaskon avatar Jul 21 '23 21:07 lukaskon

Thank you. You an send them to [email protected]. If the files are more than a few MB they might not email though. I'm assuming this non-human data and therefore not subject to human privacy rules?

lbergelson avatar Jul 21 '23 21:07 lbergelson

Correct, these are fungal genomes. I just sent an email with a dropbox link, let me know if you need anything else. Thanks for looking at it!

lukaskon avatar Jul 21 '23 22:07 lukaskon

Hello, just inquiring if any solution has been found for this issue? Thank you.

lukaskon avatar Sep 06 '23 21:09 lukaskon

Hello, I seem to be seeing this same issue in trying to use FastaAlternateReferenceMaker. I was curious if any solutions have been found? Thank you.

bobthomson avatar Jun 03 '24 21:06 bobthomson