django-todo icon indicating copy to clipboard operation
django-todo copied to clipboard

Occasional crasher when importing CSVs

Open shacker opened this issue 4 years ago • 7 comments
trafficstars

Slightly awkward - I am the project's main author and filing this bug because I need help. I see occasional crash reports when people import CSVs into the demo site. The tracebacks don't tell me anything useful beyond Exception Value: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte. I don't have access to the uploaded files because they're InMemory files. I've tried everything I can think of to reproduce the problem but just can't make it crash.

If you uploaded a CSV and got it to crash, can you provide details in this thread? Thanks.

shacker avatar Apr 01 '21 05:04 shacker

Hi Shacker,

I found an error of similar nature was reported by SO a few years back, I could check with my data scientist friends if this doesn't help out. https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte

datatalking avatar Apr 01 '21 19:04 datatalking

One of the things i like about shaker/django-todo is doesn't crash.

Ok so further finds that without being able to check, I’d guess one of two things is happening: they’re passing a filepath instead of an object or the file isn’t utf-8 encoded.

Should we use a chardet type function to score the UTF as one of the approved types before importing, insert a "this file might need to be converted to utf-8" warning?

datatalking avatar Apr 01 '21 20:04 datatalking

@datatalking Good theory - could be the file encoding. Maybe I (or one of us) just needs to intentionally save a CSV file without some other encoding and see what happens. We shouldn't need to warn though - if that turns out to be the problem, we could wrap the file opener so that it opens "as" UTF-8. Do you know of a good way to save a CSV as non-UTF-8 that you could test with? (or provide and I can test it?)

shacker avatar Apr 03 '21 06:04 shacker

@shacker, I run into file encoding issues with CSV files a lot. I have written a couple of small routines I use routinely to fix this. I will post them here later for you.

bernd-wechner avatar Sep 23 '21 11:09 bernd-wechner

Here's the function I wrote:

import magic
def file_encoding(filepath):
    '''
    Text encoding is a bit of a schmozzle in Python and csv data files. Alas.
    
    A quick summary:
    
    1. CSV files can be written with a UTF-8 or UTF-16 encoding from time to time
    2. Python wants to know the encoding when we open the file
    3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
    4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field specifying
    5. In fact Unicode standards recommend against including a BOM with UTF-8
        https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
    6. Python assumes it's not there
    7. Most apps write it (at least sometimes)
    8. The encoding must therefore be specified as:
        utf-16    for UTF-16 files
        utf-8     for UTF-8 files with no BOM
        utf-8-sig for UTF files with a BOM 
    9. The "magic" library reliably determines the encoding efficiently by looking
       at the magic numbers at the start of a file
    10. Alas it returns a rich string describing the encoding.
    11. It contains either UTF-16 or UTF-18
    12. It contains "(with BOM)" if a BOM is detected
    13. Because of this schmozzle a quick function to translate "magic" output
        to standard encoding names is here.
    
    :param filepath: The path to a file
    '''
    # Support UTF-8 or UTF-16 encodings and this is the first
    # word that the magic library returns when reporting the file type.
    m = magic.from_file(filepath)
    utf16 = m.find("UTF-16")>=0
    utf8 = m.find("UTF-8")>=0
    bom = m.find("(with BOM)")>=0
    
    if utf16:
        return "utf-16"
    elif utf8:
        if bom:
            return "utf-8-sig"
        else:
            return "utf-8"

and how I use it:

import csv
with open(data_file, "r", encoding=file_encoding(data_file), newline='') as file:
        reader = csv.DictReader(file)

Basically solved all my encoding issues with diverse CSV files I've encountered.

bernd-wechner avatar Sep 23 '21 11:09 bernd-wechner

@bernd-wechner Awesome, thanks a bunch! Do you by chance have a CSV that can crash django-todo on import? If so, can you share the file?

shacker avatar Sep 28 '21 05:09 shacker

Alas no, not I. No need for CSV import yet. I do have some PRs open for you though fixing stuff that I did need ;-)

bernd-wechner avatar Sep 28 '21 06:09 bernd-wechner