rmlint
rmlint copied to clipboard
WARNING: Error: JSON data must be UTF-8 encoded
rmlint is unable to read a 50M json file it has produced after scanning a large number of files $file large.json large.json: UTF-8 Unicode text
see attached file, large.tar.gz
Edit: doing other tests with a 10M json file found that the json file was ISO-8859-1 encoded converted it using iconv -f ISO-8859-1 -t UTF-8//TRANSLIT rmlint.json -o rmlintutf8.json and it worked
it seems that for some reason the resulting json file from rmlint is ISO-8859-1
I think it has something to do with https://github.com/sahib/rmlint/blob/master/lib/formats/json.c#L73
@sahib any thoughts on best way forwards? Maybe write using json-glib instead?
Hey,
@sahib any thoughts on best way forwards? Maybe write using json-glib instead?
Yes, I'm afraid this is the cause. That hacky util was there before we used json-glib for replay and it didn't age well.
Alternatively, one could make this function UTF-8 safe by using g_utf8_next_char and g_utf8_get_char. From what I can see that's the only place that operates on individual bytes. But honestly, using json-glib is likely the more sensible option here.
@He1my are you able to confirm if https://github.com/SeeSpotRun/rmlint/tree/glib-json fixes this?
Closing for now, can re-open if issue recurs
Hi, I have the same error but not for the same reason : My filenames are mostly in utf8 but I've some filenames in latin1 (old backups...)
LC_ALL=C rmlint .
# Duplicate(s):
ls '/tmp/test/numéros-téléphone.2002.txt'
rm '/tmp/test/num�ros-t�l�phone.txt'
When trying to replay :
LC_ALL=C rmlint --replay rmlint.json .
INFO: Loading json-results `/tmp/test/rmlint.json'
WARNING: Error: JSON data must be UTF-8 encoded
WARNING: Loading /tmp/test/rmlint.json failed.
ERROR: No valid .json files given, aborting.
I have a 1,6GB rmlint.json unreadable with current version (rev 254cf962), have you a solution ?