hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

segmentation fault at gvcf2bed

Open meredith705 opened this issue 2 years ago • 7 comments

Hello,

I'm running the hap.py v0.3.12 docker using the Genome In A Bottle HG002 vs GRCh38 truth set using the files below and getting a segmentation fault error in the gvcf2bed program.

The GIAB bottle truth set is here https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/

The vcf I'm using as truth is: HG002_GRCh38_1_22_v4.2.1_benchmark_phased_MHCassembly_StrandSeqANDTrio.vcf from their supplementary files here

High confidence bed file: HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed file from the same directory as above

The GRCh38 reference is: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz

The query vcf I'm using is created by running dipcall on two haploid assemblies of chr11 ( a small test assembly ) using the same GRCh38 assembly as a reference: paternal.dip.vcf.gz

Here is my command:

sudo docker run -it -u `id -u $USER`:`id -g $USER` -v `pwd`:/data jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/data/hg38/GIAB_HG002_GRCh38_1_22_v4.2.1_phased.vcf \
/data/wdl_scripts/paternal.dip.vcf.gz \
-f /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz \
-o /data/happy/chr11 \
--pass-only --no-roc --no-json --engine=vcfeval \
--threads=10

I get a segmentation fault error after about 2 minutes of run time.

2022-03-28 18:17:44,433 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py 
[W] overlapping records at chr6:29747433 for sample 0
[W] Variants that overlap on the reference allele: 5
[I] Total VCF records:         1759115
[I] Non-reference VCF records: 1759115
Segmentation fault
2022-03-28 18:19:58,760 ERROR    Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139
2022-03-28 18:19:58,763 ERROR    Traceback (most recent call last):
2022-03-28 18:19:58,763 ERROR      File "/opt/hap.py/bin/hap.py", line 529, in <module>
2022-03-28 18:19:58,764 ERROR        main()
2022-03-28 18:19:58,764 ERROR      File "/opt/hap.py/bin/hap.py", line 314, in main
2022-03-28 18:19:58,765 ERROR        conf_temp = Haplo.gvcf2bed.gvcf2bed(args.vcf1, args.ref, args.fp_bedfile, args.scratch_prefix)
2022-03-28 18:19:58,765 ERROR      File "/opt/hap.py/lib/python27/Haplo/gvcf2bed.py", line 39, in gvcf2bed
2022-03-28 18:19:58,765 ERROR        subprocess.check_call(cmdline, shell=True)
2022-03-28 18:19:58,765 ERROR      File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
2022-03-28 18:19:58,776 ERROR        raise CalledProcessError(retcode, cmd)
2022-03-28 18:19:58,776 ERROR    CalledProcessError: Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139

I really appreciate your help figuring out this issue, please let me know what other information I can provide.

Thank you in advance, Melissa

meredith705 avatar Mar 28 '22 18:03 meredith705

Using fasta file from alternative sources will help you get rid of this.

YuanfengZhang avatar Apr 27 '22 02:04 YuanfengZhang

I am also getting this error. But what is it about the fasta file thats causing this error? Any more information that can help debug this?

Thanks.

KatharineME avatar Apr 29 '22 16:04 KatharineME

Did you figure this out @meredith705? Looks like we are doing something very similar.

KatharineME avatar Apr 29 '22 16:04 KatharineME

@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.

The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz

meredith705 avatar May 24 '22 17:05 meredith705

@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.

The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz

(1) Try GDC/BWA/GATK/GA4GH/other sources. (2) Someone knowing cpp is need to handle this.

YuanfengZhang avatar May 25 '22 01:05 YuanfengZhang

Hi All, I got hap.py working without changing the reference genome. I did the following. I'm assuming that you are using rtg as the vcf comparison engine:

  1. Create genome sdf with rtg
rtg format -o path_to_sdf path_to_decompressed_reference_genome`
  1. Get the pkrusche/hap.py docker container with latest tag

  2. Run the hap.py docker container like so

docker run --interactive --detach --tty --user root -v local_dir_with_all_needed_files:dir_in_container pkrusche/hap.py bash
  1. Calling hap.py like so. Notice that the reference genome passed is decompressed. Remember that all paths in the command are paths in the container.
docker exec --interactive docker_container_id bash -c "/opt/hap.py/bin/hap.py path_to_truth_vcf path_to_query_vcf -f path_to_confident_regions_bed -r path_to_decompressed_reference_genome -o output_dir --engine-vcfeval-path path_to_rtg_folder --engine-vcfeval-template path_to_genome_sdf" 

If this doesn't work for you, I recommend trying to run the examples from the readme in the docker container.

KatharineME avatar May 26 '22 17:05 KatharineME

I had the same issue. Decompressing the reference fastq.gz and reindexing it with samtools faidx solved it for me. My other steps are here.

barslmn avatar Dec 30 '22 15:12 barslmn