gutenberg-dammit icon indicating copy to clipboard operation
gutenberg-dammit copied to clipboard

at least one file's utf-8 encoding is wrong, presumably more?

Open mlc opened this issue 7 years ago • 5 comments

Hi, thanks for this excellent work!

I suspect it's not an isolated incident, but don't presently have anything beyond a single anecdote:

  • 夢溪筆談 is valid UTF-8 Chinese text on the Project Gutenberg website.
  • But file 073/07317.txt in the gutenberg-dammit corpus is valid UTF-8 gibberish.
  • If you take the gutenberg-dammit file, and convert it from utf-8 to "latin-1", you end up with a file which chardet says is Big5-encoded text. This appears to be mostly correct, except that there is some garbage in it and so it can not be recoded successfully by any of the few different tools I tried.

Anyway that's the data I have for now…

mlc avatar Aug 15 '18 13:08 mlc

oof, thanks for the data point, I'll look into it when I get a sec.

aparrish avatar Aug 16 '18 04:08 aparrish

Found another one: Τα Γεωργικά is fine on the Gutenberg website, but double-utf8-encoded in gutenberg-dammit (if you recode it from utf-8 to "latin-1" you end up with valid-looking utf8 Greek text).

mlc avatar Sep 23 '18 18:09 mlc

hi—just to verify, are you using the latest version (002)? I do know for sure that the original version had messed up encodings, which is why I did a second release.

aparrish avatar Sep 25 '18 20:09 aparrish

Yup, I downloaded a fresh copy of the archive just now, and manually inspected the relevant files, in order to triple-check that these two problems still exist.

The bot I wrote using your corpus has posted 267 times as of now, so with two misencoded files found, that's a rate of about 0.7% — certainly not bad at all.

Thanks again!

mlc avatar Sep 26 '18 03:09 mlc

great, thank you for checking! I'll fix in the next release.

aparrish avatar Sep 26 '18 05:09 aparrish