mysqldump-to-csv
mysqldump-to-csv copied to clipboard
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
After writing to the CSV from the table, I was trying to open the generated CSV and found that it contains 0xff on my Windows 11 machine.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
So I had to open it as utf-16
with open(tables-imported.csv', 'r', encoding = "utf-16") as f:
Slightly more precise repro:
python mysqldump-to-csv/mysqldump_to_csv.py <enwiki-latest-categorylinks.sql
blows up with:
Traceback (most recent call last):
File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 114, in <module>
main()
File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 104, in main
for line in fileinput.input():
File "/usr/lib/python3.11/fileinput.py", line 251, in __next__
line = self._readline()
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/fileinput.py", line 372, in _readline
return self._readline()
^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1980: invalid continuation byte
The likely reason is that that file contains binary data on the third column, it's a dumpsterfire:
INSERT INTO `categorylinks` VALUES (10,'Redirects_from_moves','*..2NN:,@2.FBHRP:D6^A^W^Aܽ<DC>^L','2014-10-26 04:50:23','','uca-default-u-kn','page'),
enwiki-latest-page.sql still works.
Not entirely sure why but the solution at: https://github.com/jamesmishra/mysqldump-to-csv/issues/17 worked for me. Likely it just treats things more byte-wise, could be buggy on print, but does not blow up at least.