hap.py
hap.py copied to clipboard
pre.py errors when VCF includes HLA contigs from hg38
Hello,
I noticed pre.py errors out on the HLA contigs when I use a VCF called against all contigs in the hg38 assembly. I have been able to successfully run pre.py on an input VCF called against hg38 by excluding all the HLA records. This is the error message I get when keeping the HLA records in the VCF:
2020-08-19 12:07:25,264 ERROR Exception when running <function blocksplitWrapper at 0x2b44bbdc7488>:
2020-08-19 12:07:25,264 ERROR ------------------------------------------------------------
2020-08-19 12:07:25,264 ERROR Traceback (most recent call last):
2020-08-19 12:07:25,264 ERROR File "/usr/apps/bio/hap.py/hap.py-0.3.12/lib/python27/Tools/parallel.py", line 72, in parMapper
2020-08-19 12:07:25,265 ERROR return arg[1]['fun'](arg[0], *arg[1]['args'], **arg[1]['kwargs'])
2020-08-19 12:07:25,265 ERROR File "/usr/apps/bio/hap.py/hap.py-0.3.12/lib/python27/Haplo/partialcredit.py", line 121, in blocksplitWrapper
2020-08-19 12:07:25,265 ERROR subprocess.check_call(to_run, shell=True, stdout=tfo, stderr=tfe)
2020-08-19 12:07:25,265 ERROR File "/usr/apps/bio/python/Python-2.7.14/Lib/subprocess.py", line 186, in check_call
2020-08-19 12:07:25,275 ERROR raise CalledProcessError(retcode, cmd)
2020-08-19 12:07:25,275 ERROR CalledProcessError: Command 'blocksplit /tmp/tmpGhIe52.vcf.gz -l 'HLA-DQA1*01:01:02' -o /tmp/result.HLA-DQA1*01:01:02tUT4V6.chunks.bed --window 10000 --nblocks 40 -f 0' returned non-zero exit status 1
2020-08-19 12:07:25,275 ERROR ------------------------------------------------------------
2020-08-19 12:07:25,326 WARNING stoll
2020-08-19 12:07:25,326 ERROR Exception when running <function blocksplitWrapper at 0x2b44bbdc7488>:
2020-08-19 12:07:25,326 ERROR ------------------------------------------------------------
2020-08-19 12:07:25,327 ERROR Traceback (most recent call last):
2020-08-19 12:07:25,327 ERROR File "/usr/apps/bio/hap.py/hap.py-0.3.12/lib/python27/Tools/parallel.py", line 72, in parMapper
2020-08-19 12:07:25,327 ERROR return arg[1]['fun'](arg[0], *arg[1]['args'], **arg[1]['kwargs'])
2020-08-19 12:07:25,327 ERROR File "/usr/apps/bio/hap.py/hap.py-0.3.12/lib/python27/Haplo/partialcredit.py", line 121, in blocksplitWrapper
2020-08-19 12:07:25,327 ERROR subprocess.check_call(to_run, shell=True, stdout=tfo, stderr=tfe)
2020-08-19 12:07:25,327 ERROR File "/usr/apps/bio/python/Python-2.7.14/Lib/subprocess.py", line 186, in check_call
2020-08-19 12:07:25,327 ERROR raise CalledProcessError(retcode, cmd)
2020-08-19 12:07:25,327 ERROR CalledProcessError: Command 'blocksplit /tmp/tmpGhIe52.vcf.gz -l 'HLA-DRB1*09:21' -o /tmp/result.HLA-DRB1*09:21gYV5y0.chunks.bed --window 10000 --nblocks 40 -f 0' returned non-zero exit status 1
2020-08-19 12:07:25,327 ERROR ------------------------------------------------------------
Traceback (most recent call last):
File "/usr/apps/bio/hap.py/hap.py-0.3.12/bin/pre.py", line 408, in <module>
main()
File "/usr/apps/bio/hap.py/hap.py-0.3.12/bin/pre.py", line 404, in main
preprocessWrapper(args)
File "/usr/apps/bio/hap.py/hap.py-0.3.12/bin/pre.py", line 245, in preprocessWrapper
convert_gvcf_to_vcf=False)
File "/usr/apps/bio/hap.py/hap.py-0.3.12/bin/pre.py", line 206, in preprocess
haploid_x=gender == "male")
File "/usr/apps/bio/hap.py/hap.py-0.3.12/lib/python27/Haplo/partialcredit.py", line 192, in partialCredit
raise Exception("One of the blocksplit processes failed.")
Exception: One of the blocksplit processes failed.
Is pre.py not supported to handle these HLA contigs? It does work with all of the alt/unknown chromomes included in hg38, it only errors out when I keep the HLA variants in the VCF. I've checked the reference I was using for pre.py, and it includes all of the same contigs as the the VCF, since it was the one that I made variant calls with.
My pre.py command was:
export HGREF=Homo_sapiens_assembly38.fasta
pre.py NA1278_full.vcf out.vcf.gz
Thanks for the help, any insight is greatly appreciated!
I think that this is the same as #128 and part of #121. The underlying problem is that this code can't work: https://github.com/Illumina/hap.py/blob/79de3c780cd5aa5a7cb7ab93cbd65643fafbeddf/src/c%2B%2B/include/helpers/StringUtil.hh#L179-L207
It is meant to parse chr:start-end style ranges, but it splits on : and - in one operation, which can only work if the contig name doesn't contain a -. If the contig name does contain a -, it tries to parse the part of the contig name after the - as the range start, which mostly fails (unless it happens to actually be a number, in which case it will succeed and give a wrong answer!). Then std::stoll() throws an exception with the incredibly informative what() message of "stoll", which is passed along through the exception handling code in blocksplit and the subprocess management Python code as a WARNING log message, without further comment.
That code needs to be adapted to not depend on - not appearing in the contig name. The workaround is to rename all the contigs in your reference and VCFs to avoid the - character.