hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

Docker

Open ktadler opened this issue 6 years ago • 1 comments

Hi Peter i have couple of questions, please

  1. i try to define the reference location in both providing the -r parameter and exporting the HGREF variable the result is empty (for the TRUTH reference) table, which i guess correspond to the non-reference counts at the beginning of the run. What am I doing wrong?

[exomeuser@labsrv4 TwistExomeRefSeq_NA12878HG001_S1_4]$ export HGREF=/nadata/data/exomeseq/workspace/TwistExomeRefSeq_NA12878HG001_S1_4/compare_to_ref/hg19.fa [exomeuser@labsrv4 TwistExomeRefSeq_NA12878HG001_S1_4]$ cd compare_to_ref/ [exomeuser@labsrv4 compare_to_ref]$ docker run -it -v pwd:/data pkrusche/hap.py /opt/hap.py/bin/hap.py /data/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf.vcf /data/combined_TwistExomeRefSeq_NA12878HG001_S1_4.recal.all_edit.vcf -f /data/intersect_beds.bed -r /data/hg19.fa -o /data/test 2019-05-30 10:29:42,919 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file. 2019-05-30 10:29:42,948 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file. [I] Total VCF records: 3775119 [I] Non-reference VCF records: 3775119 [I] Total VCF records: 37554 [I] Non-reference VCF records: 34867


Benchmarking Summary: Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio INDEL ALL 0 0 0 2713 0 2713 0 0 NaN 1 NaN NaN NaN NaN 2.424936 INDEL PASS 0 0 0 2713 0 2713 0 0 NaN 1 NaN NaN NaN NaN 2.424936 SNP ALL 0 0 0 32184 0 32184 0 0 NaN 1 NaN NaN 2.599955 NaN 1.832658 SNP PASS 0 0 0 32184 0 32184 0 0 NaN 1 NaN NaN 2.599955 NaN 1.832658


  1. the docker image doesn't consider complex variants, as i understand, meaning if i want use the rtg-tools i need real installation?

  2. The documentation says the software might be installed on Centos 7, but also says i need gcc++ 4.9x, while the most updated rpm for Centos is 4.85. Am i missing something?

thank you, Katie

ktadler avatar May 30 '19 11:05 ktadler

Hi Katie,

  1. the environment variable HGREF is set outside the Docker image to a location not accessible from within. The location needs to be relative to a path mounted inside the image. In your example above you can ignore the HGREF-related message since you are specifiying the reference location explicitly as /data/hg19.fa (which should work according to your other commands). HGREF is only used as a default if no reference location is specified.

A reason for the 0 counts could be that the chromosome names are not consistent. GiaB typically has numerical names (1, 2, 3...), hg19 has chr-prefixed names (chr1, chr2, ...). The simplest way to fix this if all data is Grch37/hg19 would be to strip the chr prefix from the query file.

  1. The Docker image should contain vcfeval; also, hap.py can consider complex variants to some degree without vcfeval. To run with vcfeval you can specify --engine=vcfeval

  2. To get more recent c++ compilers on older Centos, you can use the devtoolset versions: https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ ; there is also a fix available which should (theoretically) make it work with 4.8.x: https://github.com/Illumina/hap.py/pull/82

Hope this helps! Peter

pkrusche avatar Jun 03 '19 16:06 pkrusche