rmlint icon indicating copy to clipboard operation
rmlint copied to clipboard

WARNING: Error: JSON data must be UTF-8 encoded

Open He1my opened this issue 4 years ago • 5 comments

rmlint is unable to read a 50M json file it has produced after scanning a large number of files $file large.json large.json: UTF-8 Unicode text

see attached file, large.tar.gz

Edit: doing other tests with a 10M json file found that the json file was ISO-8859-1 encoded converted it using iconv -f ISO-8859-1 -t UTF-8//TRANSLIT rmlint.json -o rmlintutf8.json and it worked

it seems that for some reason the resulting json file from rmlint is ISO-8859-1

He1my avatar Jan 06 '21 03:01 He1my

I think it has something to do with https://github.com/sahib/rmlint/blob/master/lib/formats/json.c#L73

@sahib any thoughts on best way forwards? Maybe write using json-glib instead?

SeeSpotRun avatar Mar 14 '21 02:03 SeeSpotRun

Hey,

@sahib any thoughts on best way forwards? Maybe write using json-glib instead?

Yes, I'm afraid this is the cause. That hacky util was there before we used json-glib for replay and it didn't age well. Alternatively, one could make this function UTF-8 safe by using g_utf8_next_char and g_utf8_get_char. From what I can see that's the only place that operates on individual bytes. But honestly, using json-glib is likely the more sensible option here.

sahib avatar Mar 14 '21 10:03 sahib

@He1my are you able to confirm if https://github.com/SeeSpotRun/rmlint/tree/glib-json fixes this?

SeeSpotRun avatar Mar 18 '21 22:03 SeeSpotRun

Closing for now, can re-open if issue recurs

SeeSpotRun avatar Mar 22 '21 06:03 SeeSpotRun

Hi, I have the same error but not for the same reason : My filenames are mostly in utf8 but I've some filenames in latin1 (old backups...)

LC_ALL=C rmlint .

# Duplicate(s):
    ls '/tmp/test/numéros-téléphone.2002.txt'
    rm '/tmp/test/num�ros-t�l�phone.txt'

When trying to replay :

LC_ALL=C rmlint --replay rmlint.json .
INFO: Loading json-results `/tmp/test/rmlint.json'
WARNING: Error: JSON data must be UTF-8 encoded
WARNING: Loading /tmp/test/rmlint.json failed.
ERROR: No valid .json files given, aborting.

I have a 1,6GB rmlint.json unreadable with current version (rev 254cf962), have you a solution ?

samy-p avatar Mar 08 '24 18:03 samy-p