ncov
ncov copied to clipboard
UnicodeDecodeError in sanitize_metadata
Some recent additions to GISAID are causing sanitize_metadata.py to fail on my system. It isn't obvious if this is an issue in the GISAID data, my local system/environment settings, or my data prep script.
The error can be removed by pre-processing the metadata with:
sed -i.bak 's/[\d128-\d255]//g' metadata.tsv
The error text is:
Traceback (most recent call last):
File "/home/ncov/scripts/sanitize_metadata.py", line 405, in <module>
database_ids_by_strain = get_database_ids_by_strain(
File "/home/ncov/scripts/sanitize_metadata.py", line 211, in get_database_ids_by_strain
for metadata in metadata_reader:
File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1024, in __next__
return self.get_chunk()
File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1074, in get_chunk
return self.read(nrows=size)
File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1047, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 246, in read
content = self._get_lines(rows)
File "/home/my_conda_envs/nextstrain/lib/python3.9/site-packages/pandas/io/parsers/python_parser.py", line 1049, in _get_lines
new_rows.append(next(self.data))
File "/home/my_conda_envs/nextstrain/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 7734: invalid continuation byte
The bad character seems to be present in one or more sequences in the following list, probably in the submitting or originating lab fields:
GISIAD ID
--
EPI_ISL_8054068
EPI_ISL_8351839
EPI_ISL_8515106
EPI_ISL_8722546
EPI_ISL_8633134
EPI_ISL_8711110
EPI_ISL_8480143
EPI_ISL_8607291
EPI_ISL_8844787
EPI_ISL_8508549
EPI_ISL_8893774
EPI_ISL_8931724
EPI_ISL_8837874
EPI_ISL_8826040
EPI_ISL_8722609
EPI_ISL_8932327
EPI_ISL_8789921
EPI_ISL_8664443
EPI_ISL_8663986
EPI_ISL_8818636
EPI_ISL_8790891
EPI_ISL_8790205
EPI_ISL_8788890
EPI_ISL_8607109
EPI_ISL_9015602
EPI_ISL_8785652
EPI_ISL_8681256
EPI_ISL_8683191
EPI_ISL_8055058
EPI_ISL_8766418
EPI_ISL_8242009
EPI_ISL_8585276
EPI_ISL_8927637
EPI_ISL_9010538
EPI_ISL_8465460
EPI_ISL_8579421
EPI_ISL_8976041
EPI_ISL_8976040
EPI_ISL_8975532
EPI_ISL_8985674
EPI_ISL_8985653
EPI_ISL_8985734
EPI_ISL_8925410
EPI_ISL_8799966
EPI_ISL_8931073
This seems to me to likely to be bad upstream source data from GISAID (e.g. actually invalid UTF-8).
However, it could plausibly be an issue in Pandas' row/line-chunked parsing accidentally splitting a single UTF-8 multi-byte character across reads/decodes.
Not sure without digging in more and would really need to be able to reproduce locally to diagnose this.
@jacaravas Just to clarify, is the data flow for the metadata you pass to sanitize metadata like this?
- GISAID API endpoint
- internal database
- metadata TSV
@huddlej Yes, that is correct. There is a python script that is between step 2 & 3 where extra annotations are applied, names normalized, etc... I will try to get back to this tomorrow to confirm these entries are still causing failures for me.
I struggled with this bug, and could not go beyond the sanitize metadata.py step: My metadata was retrieved from GISAID using the augur input option. The solution provided here "sed -i.bak 's/[\d128-\d255]//g' metadata.tsv " kept giving me an error "invalid collation character". What worked for me was converting the metadata file into UTF-8 encoding using Notepad++ then using the encoded version as my metadata.tsv.
@Gathii I suspect that invalid collation character from that sed command implicates something about your locale settings. I'd be curious to know what the output of the locale command is on your system. In any case, glad you found a workaround and shared it here!
@tsibley see below my locale settings:
LANG=en_US.utf-8 LC_CTYPE="en_US.utf-8" LC_NUMERIC="en_US.utf-8" LC_TIME="en_US.utf-8" LC_COLLATE="en_US.utf-8" LC_MONETARY="en_US.utf-8" LC_MESSAGES="en_US.utf-8" LC_PAPER="en_US.utf-8" LC_NAME="en_US.utf-8" LC_ADDRESS="en_US.utf-8" LC_TELEPHONE="en_US.utf-8" LC_MEASUREMENT="en_US.utf-8" LC_IDENTIFICATION="en_US.utf-8" LC_ALL=en_US.utf-8
Thanks