daff icon indicating copy to clipboard operation
daff copied to clipboard

daff breaks horribly if file is not utf8

Open SonOfLilit opened this issue 7 years ago • 5 comments

On Windows, tried both with cmd and a git bash shell:

csv_windows-1255.zip

$ daff.py version
1.3.18
$ daff.py 1.csv 2.csv
Traceback (most recent call last):
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11304, in <module>
    Coopy.main()
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3447, in main
    return coopy.coopyhx(io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3333, in coopyhx
    return self.run(args,io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3284, in run
    a = self.loadTable(aname)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 2640, in loadTable
    txt = self.io.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 9752, in getContent
    return sys_io_File.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11018, in getContent
    content = f.read(-1)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 668, in read
    return self.reader.read(size)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 4: invalid continuation byte
$ which daff
/c/Program Files/nodejs/daff
$ daff version
1.3.18
$ daff 1.csv 2.csv
@@,a,b

of course, the reason I care is that excel works notoriously badly with utf8 csvs, so my git repository is full of csvs in other encodings, and I can't convert them as part of git diff...

P.S. does anyone here know why git would accept my .gitattributes entry for *.tsv but would silently ignore the identical entry for *.csv?

SonOfLilit avatar Sep 15 '16 22:09 SonOfLilit

Thanks for reporting this @SonOfLilit. For daff.py, a hack to make this work is to edit it by hand, replacing codecs.open(path,"r","utf-8") with codecs.open(path,"r","iso-8859-1"). With that change, I see a diff of:

@@,a,b
→, à,á→â

You may need to change more if you want the diff itself to be produced in the same encoding rather than utf-8.

How ideally should this work? A parameter specifying encoding? An attempt at autodetection?

paulfitz avatar Sep 16 '16 21:09 paulfitz

param should be best, can't rely on what the file says as you can have latin1 in a utf8 file :-1:

I guess you could use auto-detection as a default, but will need something to be able to specify when things are crazy.

dogmatic69 avatar Sep 16 '16 21:09 dogmatic69

Ideally there should be a cmd parameter because some poor people need to use utf16, which can't be made sense of without very special treatment.

But more importantly, default behavior should be to work on raw, undecoded bytes. As long as you never try to split cell contents (e.g. you must output "[abc->aBc]" and not "a[b->B]c" which might split a character in the middle in utf8), every other encoding I'm aware of would work just fine, including utf8, DOS codepages, ISO codepages and Windows codepages (I must admit I have no idea how pre-Unicode chinese/japanese codepages work, but they would probably be fine too).

On Sat, Sep 17, 2016, 12:40 AM Carl Sutton [email protected] wrote:

param should be best, can't rely on what the file says as you can have latin1 in a utf8 file 👎

I guess you could use auto-detection as a default, but will need something to be able to specify when things are crazy.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paulfitz/daff/issues/71#issuecomment-247715709, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6fWvR_PGnOYspRD79VcT6HlpCUKtsks5qqwzmgaJpZM4J-are .

SonOfLilit avatar Sep 17 '16 18:09 SonOfLilit

Ok, sounds like a parameter is important since there'll always be those who need it.

I'm not sure I can completely avoid touching cell contents. There are options for whitespace-insensitive and case-insensitive diffs for example. These obviously get wacky in the general case but people want them for the common special case of plain old ascii. Would auto-detection via delegation to eg chardet [1] in python be adequate do you think @SonOfLilit?

[1] https://github.com/chardet/chardet

paulfitz avatar Sep 19 '16 21:09 paulfitz

As long as you're only touching characters that are ASCII (commas, double quotes, tabs, spaces) you should be fine with all the encodings I listed as not needing a parameter - the reason they don't is that they only differ in the non-ASCII code points.

On Tue, Sep 20, 2016, 12:17 AM Paul Fitzpatrick [email protected] wrote:

Ok, sounds like a parameter is important since there'll always be those who need it.

I'm not sure I can completely avoid touching cell contents. There are options for whitespace-insensitive and case-insensitive diffs for example. These obviously get wacky in the general case but people want them for the common special case of plain old ascii. Would auto-detection via delegation to eg chardet [1] in python be adequate do you think @SonOfLilit https://github.com/SonOfLilit?

[1] https://github.com/chardet/chardet

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paulfitz/daff/issues/71#issuecomment-248129787, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6fUNNg9OSOnPonqPf1srU3Kx8svQcks5qrvvygaJpZM4J-are .

SonOfLilit avatar Sep 20 '16 00:09 SonOfLilit