framework icon indicating copy to clipboard operation
framework copied to clipboard

Improve clarity of the encoding error

Open aborruso opened this issue 2 years ago • 12 comments

Hi, I have this CSV file https://gist.github.com/aborruso/ec970c3a56596f9c014794466ce2f1d8

If I validate it via CLI I have

'charmap' codec can't decode byte 0x9d in position 3116: character maps to <undefined>

If I try to inspect it

head -c 3116 input.csv | tail -c -1

I get nothing special, I don't see a strange character.

How can I use this validate error message, to clean this CSV file?

Thank you

aborruso avatar Jan 12 '23 08:01 aborruso

Hi @aborruso, Thank you for reporting!.

If I understand your question correctly, it should solve the issue:

frictionless validate finanziamenti.csv --encoding utf-8 validate

shashigharti avatar Jan 12 '23 09:01 shashigharti

If I understand your question correctly, it should solve the issue:

Ok, thank you, I know, but why doesn't it map it automatically as utf-8? Shouldn't it do it automatically?

Thank you again

aborruso avatar Jan 12 '23 09:01 aborruso

Thanks! we will check.

shashigharti avatar Jan 12 '23 09:01 shashigharti

Ok, thank you, I know, but why doesn't it map it automatically as utf-8? Shouldn't it do it automatically?

Moreover chardetect gives utf-8 with confidence 0.99

aborruso avatar Jan 12 '23 10:01 aborruso

Thank you!

It seems to be a bug, I also checked and it is inferred as 'utf-8' with 99% confidence in command line, but same library in frictionless gives different result Windows-1252 (.73%)

shashigharti avatar Jan 13 '23 10:01 shashigharti

Hi, there are two aspects:

  • the underlying detection library (Python version of chardet) detects it as cp1252 unless we use a bigger buffer size frictionless describe tmp/finanziamenti.csv --buffer-size 1000000 (unfortunately, we can't fix the root cause on the Frictionless level)
  • anyway the error message is not clear enough so I've updated to issue to improve the error message

roll avatar Jan 16 '23 18:01 roll

  • the underlying detection library (Python version of chardet) detects it as cp1252

We should understand what is the mechanism of chardetect, through the cli. Because that's the right one Because via cli I have utf-8 with confidence 0.99.

Thank you @roll

aborruso avatar Jan 16 '23 18:01 aborruso

@aborruso I was wrong because I did not consider rows that framework uses to predict the encoding.

Just to add up to what @roll has said, if I use only 500 rows in command line: head -500 finanziamenti.csv | chardet

it predicts(same as framework does): <stdin>: Windows-1252 with confidence 0.73

So if you run validation increasing the buffer size(5000/1000000) then it infers correct encoding and validation passes(as said above by evgeny): frictionless validate finanziamenti.csv --buffer-size 1000000

shashigharti avatar Jan 17 '23 06:01 shashigharti

What chardet do you use in CLI? (note that under the same name might be different implementations)

roll avatar Jan 17 '23 09:01 roll

The standard https://github.com/chardet/chardet

And I run simply "chardetect input.csv"

aborruso avatar Jan 17 '23 09:01 aborruso

@roll also via code and not cli, I have utf-8.

The sample code

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('finanziamenti.csv'):
    print(filename.ljust(60), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector. Result)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

aborruso avatar Jan 17 '23 11:01 aborruso

I think the difference is that Frictionless feed all the buffer (10000 bytes by default) to the chardet detector and at some point of this file some weird char that confuses chardet. It we reduce the buffer size it also detects utf-8:

  • frictionless describe tmp/finanziamenti.csv --buffer-size 100
  • frictionless describe tmp/finanziamenti.csv --buffer-size 1000

roll avatar Jan 17 '23 11:01 roll