hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

PreProcessing creates un-ordered VCF

Open MattWellie opened this issue 3 years ago • 1 comments

Error message:

Exception: Command line bcftools index -t /tmp/tmpIg13Dm.vcf.gz got return code 255.STDOUT: STDERR: [E::hts_idx_push] unsorted positions on sequence #5: 38598232 followed by 38598227index: failed to create index for "/tmp/tmpIg13Dm.vcf.gz"
Error running BCFTOOLS; please check your file for compatibility issues issues using vcfcheck

This is caused by running the following command

hap.py {truth_vcf.gz} {input_vcf.bgz} -r {path_to_fasta} -R {path_to_bed} -o {path_to_output} --engine-vcfeval-path={path_to_rtg} --threads 10 --engine-vcfeval-template {path_to_sdf} --engine=vcfeval --preprocess-truth

The pre-processing of the truth VCF (with tons of logging lines as noted in #129), but fails during the processing of the input VCF. The sites in question (chr5:38598232, chr5:38598227) are correctly ordered in the query VCF before processing, and are 2 deletions. It seems like at some point during the pre-processing the sites are becoming disordered, possibly as a result of the ref alleles overlapping. Without including any additional VCF content, the 2 lines as they appear in the sorted, indexed, input VCF:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr5	38598227	rs68118711	TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC	T
chr5	38598232	rs373607072;rs71963527	TATTTATTTA	T

If this is a result of decomposition, could this be resolved by adding in a bcftools sort call prior to the index?

This issue appears to be the same as #109, but a fix was proposed for that issue on a branch that no longer exists, and the issue is closed.

MattWellie avatar Aug 15 '22 04:08 MattWellie

I've solved this problem myself by updating the bcftools version within the image

The issue is that the BCFtools 1.4.1 bundled as an external tool here seems to have a dumb BED file filtering process, where each region in the BED file is processed in turn, and variants are appended to the output. Relevant regions from my BED file are:

chr5	38598232	38598244
chr5	38598259	38602497

In this example both variants overlap with the first interval, and the first (sorted by POS) ends in the second interval so it is being added twice: (example salvaged from the temp VCF generated by bcftools.py's preProcessVCF)

chr5	38598227	rs68118711	TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC	T	...
chr5	38598232	rs373607072;rs71963527	TATTTATTTA	T	...
chr5	38598227	rs68118711	TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC	T	...

PLEASE PLEASE PLEASE update the bundled tool binaries in this repository, the SAMtools/BCFtools are 5 years old now

MattWellie avatar Aug 16 '22 02:08 MattWellie