hap.py
hap.py copied to clipboard
segmentation fault at gvcf2bed
Hello,
I'm running the hap.py v0.3.12 docker using the Genome In A Bottle HG002 vs GRCh38 truth set using the files below and getting a segmentation fault error in the gvcf2bed program.
The GIAB bottle truth set is here https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/
The vcf I'm using as truth is:
HG002_GRCh38_1_22_v4.2.1_benchmark_phased_MHCassembly_StrandSeqANDTrio.vcf
from their supplementary files here
High confidence bed file:
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
file from the same directory as above
The GRCh38 reference is: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz
The query vcf I'm using is created by running dipcall on two haploid assemblies of chr11 ( a small test assembly ) using the same GRCh38 assembly as a reference: paternal.dip.vcf.gz
Here is my command:
sudo docker run -it -u `id -u $USER`:`id -g $USER` -v `pwd`:/data jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/data/hg38/GIAB_HG002_GRCh38_1_22_v4.2.1_phased.vcf \
/data/wdl_scripts/paternal.dip.vcf.gz \
-f /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz \
-o /data/happy/chr11 \
--pass-only --no-roc --no-json --engine=vcfeval \
--threads=10
I get a segmentation fault error after about 2 minutes of run time.
2022-03-28 18:17:44,433 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py
[W] overlapping records at chr6:29747433 for sample 0
[W] Variants that overlap on the reference allele: 5
[I] Total VCF records: 1759115
[I] Non-reference VCF records: 1759115
Segmentation fault
2022-03-28 18:19:58,760 ERROR Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139
2022-03-28 18:19:58,763 ERROR Traceback (most recent call last):
2022-03-28 18:19:58,763 ERROR File "/opt/hap.py/bin/hap.py", line 529, in <module>
2022-03-28 18:19:58,764 ERROR main()
2022-03-28 18:19:58,764 ERROR File "/opt/hap.py/bin/hap.py", line 314, in main
2022-03-28 18:19:58,765 ERROR conf_temp = Haplo.gvcf2bed.gvcf2bed(args.vcf1, args.ref, args.fp_bedfile, args.scratch_prefix)
2022-03-28 18:19:58,765 ERROR File "/opt/hap.py/lib/python27/Haplo/gvcf2bed.py", line 39, in gvcf2bed
2022-03-28 18:19:58,765 ERROR subprocess.check_call(cmdline, shell=True)
2022-03-28 18:19:58,765 ERROR File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
2022-03-28 18:19:58,776 ERROR raise CalledProcessError(retcode, cmd)
2022-03-28 18:19:58,776 ERROR CalledProcessError: Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139
I really appreciate your help figuring out this issue, please let me know what other information I can provide.
Thank you in advance, Melissa
Using fasta file from alternative sources will help you get rid of this.
I am also getting this error. But what is it about the fasta file thats causing this error? Any more information that can help debug this?
Thanks.
Did you figure this out @meredith705? Looks like we are doing something very similar.
@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.
The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.
The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
(1) Try GDC/BWA/GATK/GA4GH/other sources. (2) Someone knowing cpp is need to handle this.
Hi All, I got hap.py working without changing the reference genome. I did the following. I'm assuming that you are using rtg as the vcf comparison engine:
- Create genome sdf with rtg
rtg format -o path_to_sdf path_to_decompressed_reference_genome`
-
Get the pkrusche/hap.py docker container with
latest
tag -
Run the hap.py docker container like so
docker run --interactive --detach --tty --user root -v local_dir_with_all_needed_files:dir_in_container pkrusche/hap.py bash
- Calling hap.py like so. Notice that the reference genome passed is decompressed. Remember that all paths in the command are paths in the container.
docker exec --interactive docker_container_id bash -c "/opt/hap.py/bin/hap.py path_to_truth_vcf path_to_query_vcf -f path_to_confident_regions_bed -r path_to_decompressed_reference_genome -o output_dir --engine-vcfeval-path path_to_rtg_folder --engine-vcfeval-template path_to_genome_sdf"
If this doesn't work for you, I recommend trying to run the examples from the readme in the docker container.
I had the same issue. Decompressing the reference fastq.gz and reindexing it with samtools faidx solved it for me. My other steps are here.