duphold icon indicating copy to clipboard operation
duphold copied to clipboard

No error generated but duphold output missing last few lines.

Open jjfarrell opened this issue 4 years ago • 18 comments

I am running duphold on about 5000 crams. About 120 or so are finishing but the output vcf is not complete. It ends prematurely near the end of chromosome 22. There are only a few lines left to process of the imput vcf.

The last full vcf line output varies but all near the end. 1 50796757 24 50797069 24 50797489 7 50797585 7 50797605 4 50797606 11 50798225 14 50798764 25 50798783 20 50799200

here is an example of the end of one file...

chr22   50797069        rs201907533     GA      G       .       PASS    AVGCOV=7;MINCOV=7;ALTCOV=0;ZYG=hom;COVRATIO=1;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1579delA||||||INFO_REALIGN_3_PRIME;CAF=[0.9876,0.0124];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201907533;RSPOS=51235498;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137;GCF=0.254902   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.868421:1.17857:0.825:0
chr22   5079748

Any suggestions?

jjfarrell avatar Aug 28 '19 19:08 jjfarrell

did those exit with 0? do you see: "[duphold] finished" in the stderr of those jobs? did you allocate enough time, memory for the job to finish on the nodes?

make sure you're using the latest duphold. I don't recall any changes relevant here, but it's always good to have the most recent.

brentp avatar Aug 28 '19 19:08 brentp

I am getting a [duphold] finished at the end of the run. I gave it plenty of time. The most recent version (version: 0.1.4) is being used.

Below is the job output. The tabix command runs after the duphold program finishes which gets the error seen.

This are the two command lines I am using:

duphold -f /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -v adsp5k.scalpel.genes.indel.norm.ann.vcf.gz -d -b $CRAM|bgzip -c > output/$SAMPLE.vcf.gz
tabix -p vcf output/$SAMPLE.vcf.gz

The output is:

more duphold.sh.o8967855
[duphold] finished
[E::hts_idx_push] Unsorted positions on sequence #22: 50797069 followed by 5079748
tbx_index_build failed: output/A-WCAP-WC000752-BL-COL-51868BL1.vcf.gz

duphold works fine for the 4600 other crams.

jjfarrell avatar Aug 28 '19 19:08 jjfarrell

is the error intermittent? or does it occur every time for these same cram files?

brentp avatar Aug 28 '19 20:08 brentp

I had rerun them and had the same issue. I will try increasing the memory available on the nodes and see it that helps.

jjfarrell avatar Aug 29 '19 01:08 jjfarrell

I doubt it's the memory, but I can't think of anything else. can you share the vcf+bam for one of the failing samples?

brentp avatar Aug 29 '19 01:08 brentp

and you are running all samples with -v adsp5k.scalpel.genes.indel.norm.ann.vcf.gz?

brentp avatar Aug 29 '19 01:08 brentp

Yes, the same script is being used with the same vcf containing all the variants in the 5k samples.

I can not share these files but I will see if I can create a smaller test cram and vcf that recreates the issue.

I reran one on a node with plenty of memory writing it directly to a vcf. Still an error but not exactly at the same spot but close. So as you suggested not a memory issue.

tail output/test,vcf|awk '{print NF "\t" $0}'

10      chr22   50794791        .       CT      C       .       LowAltCnt       AVGCOV=3;MINCOV=3;ALTCOV=21;ZYG=het;COVRATIO=0.12;CHI2=13.5;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3863delT||||||;GCF=0.431373   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.03922:1.2619:1.01923:3
10      chr22   50794884        .       AC      A       .       PASS    AVGCOV=10;MINCOV=10;ALTCOV=22;ZYG=het;COVRATIO=0.31;CHI2=4.5;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3770delC||||||;GCF=0.558824  GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.09804:1.30233:1.16667:3
10      chr22   50795110        .       AT      A       .       LowAltCnt;HighChi2score AVGCOV=3;MINCOV=3;ALTCOV=32;ZYG=het;COVRATIO=0.09;CHI2=24.03;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-3544delT||||||;GCF=0.45098   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.745098:0.863636:0.730769:3
10      chr22   50795909        .       AT      A       .       PASS    AVGCOV=17;MINCOV=17;ALTCOV=21;ZYG=het;COVRATIO=0.45;CHI2=0.42;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2741delT||||||INFO_REALIGN_3_PRIME;GCF=0.254902     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.01961:1.33333:0.83871:0
10      chr22   50795925        .       AAAGT   A       .       PASS    AVGCOV=19;MINCOV=19;ALTCOV=27;ZYG=het;COVRATIO=0.41;CHI2=1.39;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2726_333-2723delTAAG||||||INFO_REALIGN_3_PRIME;GCF=0.247619 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.01961:1.33333:0.83871:0
10      chr22   50796118        .       A       ACATACTAT       .       PASS    AVGCOV=20;MINCOV=20;ALTCOV=22;ZYG=het;COVRATIO=0.48;CHI2=0.1;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=ACATACTAT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2534_333-2527dupTACTATCA||||||INFO_REALIGN_3_PRIME;GCF=0.336634      GT:AD:DP:DHFC:DHFFC:DHBFC       0/0:.:.:0.843137:1.16216:0.741379
10      chr22   50796247        .       GA      G       .       PASS    AVGCOV=10;MINCOV=10;ALTCOV=13;ZYG=het;COVRATIO=0.43;CHI2=0.39;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2406delA||||||INFO_REALIGN_3_PRIME;GCF=0.470588     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:1.01961:1.44444:1:0
10      chr22   50796312        .       GT      G       .       PASS    AVGCOV=3;MINCOV=3;ALTCOV=11;ZYG=het;COVRATIO=0.21;CHI2=4.57;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2342delT||||||;GCF=0.460784   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:0.837838:0.596154:0
10      chr22   50796312        .       GTC     G       .       PASS    AVGCOV=6;MINCOV=6;ALTCOV=16;ZYG=het;COVRATIO=0.27;CHI2=4.55;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-2342_333-2341delTC||||||;GCF=0.456311 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:0.837838:0.596154:0
8       chr22   50796312        .       GTCCC   G       .       LowAltCnt       AVGCOV=4;MINCOV=4;ALTCOV=11;ZYG=het;COVRATIO=0.27;CHI2=3.27;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00

jjfarrell avatar Aug 29 '19 03:08 jjfarrell

instead of piping to bgzip, can you instead just use -o $sample.vcf.gz? that will be faster anyway and rule out piping issues.

brentp avatar Aug 29 '19 03:08 brentp

That fixed it on the test sample! Must be some issue with closing the pipe. Thanks for pin pointing the issue so quickly.

I check a few of the samples that were not getting an error with tabix. Not all were getting the last few variants. They probably just happened to end with an EOL. So I will re-run it on all the samples.

tabix output/ADNI_002_S_4229.vcf.gz chr22:50799200|tail|awk '{print NF "\t" $0}'

10      chr22   50799200        .       CAACTT  C       .       PASS    AVGCOV=9;MINCOV=9;ALTCOV=9;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.881_885delCTTAA||||||INFO_REALIGN_3_PRIME;GCF=0.216981     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.196078:0.277778:0.16129:1
10      chr22   50799573        .       CAT     C       .       LowAltCnt       AVGCOV=3;MINCOV=3;ALTCOV=15;ZYG=het;COVRATIO=0.17;CHI2=8;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.1253_1254delTA||||||INFO_REALIGN_3_PRIME;GCF=0.291262    GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.627451:0.842105:0.52459:0
[farrell@scc-hadoop duphold]$  zcat  output/ADNI_002_S_4229.vcf.gz|tail|awk '{print NF "\t" $0}'
10      chr22   50797069        rs201907533     GA      G       .       PASS    AVGCOV=7;MINCOV=7;ALTCOV=0;ZYG=hom;COVRATIO=1;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1579delA||||||INFO_REALIGN_3_PRIME;CAF=[0.9876,0.0124];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201907533;RSPOS=51235498;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137;GCF=0.254902      GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP   0/0:.:.:0.431373:0.814815:0.354839:0
10      chr22   50797489        .       CAG     C       .       PASS    AVGCOV=8;MINCOV=8;ALTCOV=16;ZYG=het;COVRATIO=0.33;CHI2=2.67;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1165_333-1164delAG||||||;GCF=0.378641 GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.529412:1.17391:0.490909:2
10      chr22   50797585        rs200507571     A       AT      .       PASS    AVGCOV=12;MINCOV=12;ALTCOV=15;ZYG=het;COVRATIO=0.44;CHI2=0.33;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=AT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1064dupT||||||INFO_REALIGN_3_PRIME;CAF=[0.8962,0.1038];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;NOC;RS=200507571;RSPOS=51236013;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000210;WGT=1;dbSNPBuildID=137;GCF=0.306931        GT:AD:DP:DHFC:DHFFC:DHBFC        0/0:.:.:0.627451:1.3913:0.52459
10      chr22   50797605        .       C       CATTT   .       LowAltCnt;HighChi2score AVGCOV=3;MINCOV=3;ALTCOV=83;ZYG=het;COVRATIO=0.03;CHI2=74.42;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=CATTT|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1032_333-1029dupTTTA||||||INFO_REALIGN_3_PRIME;GCF=0.366337      GT:AD:DP:DHFC:DHFFC:DHBFC       0/0:.:.:0.627451:1.3913:0.551724
10      chr22   50797606        .       ATTTAT  A       .       PASS    AVGCOV=16;MINCOV=16;ALTCOV=20;ZYG=het;COVRATIO=0.44;CHI2=0.44;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1046_333-1042delTATTT||||||INFO_REALIGN_3_PRIME;GCF=0.377358        GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.627451:1.3913:0.581818:2
10      chr22   50798225        .       ATACT   A       .       PASS    AVGCOV=8;MINCOV=8;ALTCOV=8;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-427_333-424delCTTA||||||INFO_REALIGN_3_PRIME;GCF=0.2       GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.176471:0.36:0.147541:0
10      chr22   50798764        .       AAAG    A       .       PASS    AVGCOV=5;MINCOV=5;ALTCOV=22;ZYG=het;COVRATIO=0.19;CHI2=10.7;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=A|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.447_449delAGA||||||INFO_REALIGN_3_PRIME;GCF=0.538462  GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:1.34783:0.645833:0
10      chr22   50798783        .       CT      C       .       PASS    AVGCOV=17;MINCOV=17;ALTCOV=22;ZYG=het;COVRATIO=0.44;CHI2=0.64;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.462delT||||||;GCF=0.568627  GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.607843:1.34783:0.645833:1
10      chr22   50799200        .       CAACTT  C       .       PASS    AVGCOV=9;MINCOV=9;ALTCOV=9;ZYG=het;COVRATIO=0.5;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=C|non_coding_transcript_exon_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|3/3|n.881_885delCTTAA||||||INFO_REALIGN_3_PRIME;GCF=0.216981     GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.196078:0.277778:0.16129:1
10      chr22   50799573        .       CAT     C       .       LowAltCnt       AVGCOV=3;MINCOV=3;ALTCOV=15;ZYG=het;COVRATIO=0.17;CHI2=8;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;A

jjfarrell avatar Aug 29 '19 15:08 jjfarrell

I still don't know why/how this happened, but, if you weren't already, always run any script with pipes with the set -o pipefail option in bash to make sure you catch the actual error.

brentp avatar Aug 29 '19 15:08 brentp

There is no error triggered with that option.

I took a look at some of the code. I think it can be traced to this code.

https://github.com/brentp/hts-nim/blob/master/src/hts/vcf.nim

proc close*(v:VCF) =
  if v.fname != "-" and v.fname != "/dev/stdin" and v.hts != nil:
    if hts_close(v.hts) != 0:
        when defined(debug):
            stderr.write_line "[hts-nim] error closing vcf"
    v.hts = nil

Probably just needs a flushFile(stdout) if v.fname='-'. .

jjfarrell avatar Aug 29 '19 23:08 jjfarrell

yes. that's it. nicely spotted! I think it should just hts_close("/dev/stdin"). I'll get this in the next release but you have a work-around for now. thanks for finding the solution.

brentp avatar Aug 29 '19 23:08 brentp

here's a binary with that fixed. duphold.gz

brentp avatar Aug 30 '19 00:08 brentp

this is fixed in latest release. thanks for reporting and testing.

brentp avatar Sep 11 '19 02:09 brentp

I see a similar/same issue with release 0.2.1; changing from piping to bgzip to using -o sample.vcf.gz fixed it for me as well. Happy to help debug this.

cbrueffer avatar May 27 '20 10:05 cbrueffer

I don't see a way to fix this as it is explicitly closing the file and flushing in case of stdout. I did try adding an additional flush if you want to try the attached binary.

duphold_dev.gz

if this doesn't work, i'll probably add a check to ensure that stdout is not used.

brentp avatar May 27 '20 14:05 brentp

Thanks for the reply; I'm seeing the same problem with the binary you posted. What happens is that in the pipe case, the last VCF record is cut off. Original and dev binary produce the same VCF in the pipe case and the -o case respectively.

Last record with -o:

chrUn_GL000218v1	54466	gridss314fb_2462h	G	]chrUn_GL000218v1:54465]GTGCACACGTGTGTG	120.52	LOW_QUAL;NO_ASSEMBLY	AS=0;ASC=1X;ASQ=0;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0;BANSR=0;BANSRQ=0;BAQ=0;BASRP=0;BASSR=0;BENAMES=NDX550213:6:HHHCCBGXF:1:21305:11427:10651;BMQ=47;BMQN=47;BMQX=47;BPNAMES=NDX550213:6:HHHCCBGXF:1:11201:6519:18766,NDX550213:6:HHHCCBGXF:2:12112:4327:15674,NDX550213:6:HHHCCBGXF:3:12407:2414:6309,NDX550213:6:HHHCCBGXF:4:23403:21325:6847;BQ=33.65;BSC=2;BSCQ=33.65;BUM=0;BUMQ=0;BVF=1;CAS=0;CASQ=0;CQ=120.52;EVENT=gridss314fb_2462;IC=4;IHOMPOS=0,0;IQ=120.52;MATEID=gridss314fb_2462o;MQ=32.25;MQN=24;MQX=40;RAS=0;RASQ=0;REF=19;REFPAIR=2;RP=0;RPQ=0;SB=0.666667;SC=1X111M;SR=0;SRQ=0;SVTYPE=BND;VF=4;GCF=0.39604	GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF:DHFC:DHFFC:DHBFC	.:0:0:0:0:0:0:0:0:0:0:33.65:2:33.65:0:0:1:0:4:120.52:120.52:0:19:2:0:0:0:0:4:1.11765:1.58333:1

Piped to bgzip:

chrUn_GL000218v1	54466	gridss314fb_2462h	G	]chrUn_GL000218v1:54465]GTGCACACGTGTGTG	120.52	LOW_QUAL;NO_ASSEMBLY	AS=0;ASC=1X;ASQ=0;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0;BANSR=0;BANSRQ=0;BAQ=0;BASRP=0;BASSR=0;BENAMES=NDX550213:6:HHHCCBGXF:1:21305:11427:10651;BMQ=47;BMQN=47;BMQX=47;BPNAMES=NDX550213:6:HHHCCBGXF:1:11201:6519:18766,NDX550213:6:HHHCCBGXF:2:12112:4327:15674,NDX550213:6:HHHCCBGXF:3:12407:2414:6309,NDX550213:6:HHHCCBGXF:4:23403:21325:6847;BQ=33.65;BSC=2;BSCQ=33.65;BUM=0;BUMQ=0;BVF=1;CAS=0;CASQ=0;CQ=120.52;EVENT=gridss314fb_2462;IC=4;IHOMPOS=0,0;IQ=120.52;MATEID=gridss314fb_2462o;MQ=32.25;MQN=24;MQX=40;RAS=0;RASQ=0;REF=19;REFPAIR=2;RP=0;RPQ=0;SB=0.666667;SC=1X111M;SR=0;SRQ=0;SVTYPE=BND;VF=4;GCF=0.39604	GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RA

Piping into bgzip in the first place was a documentation issue for me; I didn't realize duphold could compress files directly before finding this PR, so perhaps this is what could be improved instead.

cbrueffer avatar May 27 '20 17:05 cbrueffer

I am happy to accept a PR to improve docs.

brentp avatar May 27 '20 20:05 brentp