vcfpy icon indicating copy to clipboard operation
vcfpy copied to clipboard

Pysam UnicodeDecodeError when loading with tabixed VCF

Open dgomezpere opened this issue 4 years ago • 5 comments

  • vcfpy version: 0.13.2
  • Python version: 3.6.9 64bit [GCC 8.4.0]
  • Operating System: Linux 4.15.0 1093 oem x86_64 with Ubuntu 18.04 bionic

Description

When I fetch variants by contig ID I get the following UnicodeDecodeError demosntrating some issues when parsing the tabix file. Maybe the issue comes from pysam, but I would like to know if you have had previous reports based on this issue.

What I Did

  • Tabix VCF file
$ tabix -p vcf <vcf_filepath>
reader = vcfpy.Reader.from_path(path=DATA['annot_vcf'], tabix_path=DATA['annot_vcf']+'.tbi')
for record in reader.fetch('chr1'):
    [...]

Traceback Error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-046818a3e579> in <module>
      3 variant_records = []
      4 sample_records = []
----> 5 for record in reader.fetch('chr1'):
      6     if record.CHROM in wanted_chroms:
      7         ALT = record.ALT[0].value

/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py in __next__(self)
    171         """
    172         if self.tabix_iter:
--> 173             return self.parser.parse_line(str(next(self.tabix_iter)))
    174         else:
    175             result = self.parser.parse_next_record()

pysam/libctabix.pyx in pysam.libctabix.TabixIterator.__next__()

pysam/libcutils.pyx in pysam.libcutils.charptr_to_str()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2821: ordinal not in range(128)

dgomezpere avatar Sep 14 '20 16:09 dgomezpere

Interesting, what is your locale setting? C? What happens if you set export LC_ALL=en_US.UTF-8 or similar?

holtgrewe avatar Sep 14 '20 20:09 holtgrewe

Hi @holtgrewe !! My locale settings are already in en_US.UTF-8:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

dgomezpere avatar Sep 14 '20 21:09 dgomezpere

Any other idea about the issue @holtgrewe?? Thanks in advance!!

dgomezpere avatar Sep 16 '20 14:09 dgomezpere

It looks like that you have non-ASCII unicode in your VCF file and pysam is stumbling over this...

holtgrewe avatar Sep 16 '20 15:09 holtgrewe

Hm, I don't remember why I was using pysam in favour of pytabix. I don't know whether that is more robust... Hm, one could try to replace the tabix part of pysam with pytabix in vcfpy...

holtgrewe avatar Sep 16 '20 15:09 holtgrewe