vcfpy
vcfpy copied to clipboard
Pysam UnicodeDecodeError when loading with tabixed VCF
- vcfpy version: 0.13.2
- Python version: 3.6.9 64bit [GCC 8.4.0]
- Operating System: Linux 4.15.0 1093 oem x86_64 with Ubuntu 18.04 bionic
Description
When I fetch variants by contig ID I get the following UnicodeDecodeError
demosntrating some issues when parsing the tabix file. Maybe the issue comes from pysam
, but I would like to know if you have had previous reports based on this issue.
What I Did
- Tabix VCF file
$ tabix -p vcf <vcf_filepath>
reader = vcfpy.Reader.from_path(path=DATA['annot_vcf'], tabix_path=DATA['annot_vcf']+'.tbi')
for record in reader.fetch('chr1'):
[...]
Traceback Error
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-38-046818a3e579> in <module>
3 variant_records = []
4 sample_records = []
----> 5 for record in reader.fetch('chr1'):
6 if record.CHROM in wanted_chroms:
7 ALT = record.ALT[0].value
/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py in __next__(self)
171 """
172 if self.tabix_iter:
--> 173 return self.parser.parse_line(str(next(self.tabix_iter)))
174 else:
175 result = self.parser.parse_next_record()
pysam/libctabix.pyx in pysam.libctabix.TabixIterator.__next__()
pysam/libcutils.pyx in pysam.libcutils.charptr_to_str()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2821: ordinal not in range(128)
Interesting, what is your locale setting? C
? What happens if you set export LC_ALL=en_US.UTF-8
or similar?
Hi @holtgrewe !!
My locale settings are already in en_US.UTF-8
:
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Any other idea about the issue @holtgrewe?? Thanks in advance!!
It looks like that you have non-ASCII unicode in your VCF file and pysam is stumbling over this...
Hm, I don't remember why I was using pysam in favour of pytabix. I don't know whether that is more robust... Hm, one could try to replace the tabix part of pysam with pytabix in vcfpy...