django-todo
django-todo copied to clipboard
Occasional crasher when importing CSVs
Slightly awkward - I am the project's main author and filing this bug because I need help. I see occasional crash reports when people import CSVs into the demo site. The tracebacks don't tell me anything useful beyond Exception Value: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte. I don't have access to the uploaded files because they're InMemory files. I've tried everything I can think of to reproduce the problem but just can't make it crash.
If you uploaded a CSV and got it to crash, can you provide details in this thread? Thanks.
Hi Shacker,
I found an error of similar nature was reported by SO a few years back, I could check with my data scientist friends if this doesn't help out. https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte
One of the things i like about shaker/django-todo is doesn't crash.
Ok so further finds that without being able to check, I’d guess one of two things is happening: they’re passing a filepath instead of an object or the file isn’t utf-8 encoded.
Should we use a chardet type function to score the UTF as one of the approved types before importing, insert a "this file might need to be converted to utf-8" warning?
@datatalking Good theory - could be the file encoding. Maybe I (or one of us) just needs to intentionally save a CSV file without some other encoding and see what happens. We shouldn't need to warn though - if that turns out to be the problem, we could wrap the file opener so that it opens "as" UTF-8. Do you know of a good way to save a CSV as non-UTF-8 that you could test with? (or provide and I can test it?)
@shacker, I run into file encoding issues with CSV files a lot. I have written a couple of small routines I use routinely to fix this. I will post them here later for you.
Here's the function I wrote:
import magic
def file_encoding(filepath):
'''
Text encoding is a bit of a schmozzle in Python and csv data files. Alas.
A quick summary:
1. CSV files can be written with a UTF-8 or UTF-16 encoding from time to time
2. Python wants to know the encoding when we open the file
3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field specifying
5. In fact Unicode standards recommend against including a BOM with UTF-8
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
6. Python assumes it's not there
7. Most apps write it (at least sometimes)
8. The encoding must therefore be specified as:
utf-16 for UTF-16 files
utf-8 for UTF-8 files with no BOM
utf-8-sig for UTF files with a BOM
9. The "magic" library reliably determines the encoding efficiently by looking
at the magic numbers at the start of a file
10. Alas it returns a rich string describing the encoding.
11. It contains either UTF-16 or UTF-18
12. It contains "(with BOM)" if a BOM is detected
13. Because of this schmozzle a quick function to translate "magic" output
to standard encoding names is here.
:param filepath: The path to a file
'''
# Support UTF-8 or UTF-16 encodings and this is the first
# word that the magic library returns when reporting the file type.
m = magic.from_file(filepath)
utf16 = m.find("UTF-16")>=0
utf8 = m.find("UTF-8")>=0
bom = m.find("(with BOM)")>=0
if utf16:
return "utf-16"
elif utf8:
if bom:
return "utf-8-sig"
else:
return "utf-8"
and how I use it:
import csv
with open(data_file, "r", encoding=file_encoding(data_file), newline='') as file:
reader = csv.DictReader(file)
Basically solved all my encoding issues with diverse CSV files I've encountered.
@bernd-wechner Awesome, thanks a bunch! Do you by chance have a CSV that can crash django-todo on import? If so, can you share the file?
Alas no, not I. No need for CSV import yet. I do have some PRs open for you though fixing stuff that I did need ;-)