the-pile
the-pile copied to clipboard
Ubuntu IRC broken encoding, impacting generative models downstream
The Ubuntu IRC dataset appears to contain broken character encoding, which noticeably impacts generated output from models trained on The Pile in certain situations.
For example, from https://irclogs.ubuntu.com/2020/08/23/%23ubuntu.txt
This file contains ¯\_(ツ)_/¯
which should instead show as ¯\_(ツ)_/¯
, if it were properly encoded.
I can't currently inspect the data directly in The Pile, because the-eye.eu and eaidata.bmk.sh are both inaccessible right now.
However, I have seen lots of garbled output from GPT-J that looks remarkably similar to this broken encoding, e.g. ¯_(ã)_/¯
It looks like this dataset could be cleaned by using the ftfy
python library. https://ftfy.readthedocs.io/en/latest/
In my very brief testing, this appears to fix the broken encoding from the file linked above.
~Could we download them again without errors, or are they gone?~
So my guess is that is a utf8-to-ascii error. Maybe the server is messing with the encoding?
try to request utf8 when doing the GET request.
I don't believe you can specify character encoding in HTTP requests. I'll try to contact the author of the bot that scrapes for irclogs.ubuntu.com to get some insight, or report a bug (no way the data has been encoded wrong for over a decade, right?...)
Found the solution. The .txt files are mixed encoding, line-by-line.
This dataset must be properly decoded before use. This can be done fairly simply:
https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L199-L208
https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L141-L154
@briansemrau do you know if huggingface
would decode this properly? i'm not sure where i should look into from https://github.com/huggingface/datasets/tree/main/src/datasets/utils
do you know if huggingface would decode this properly?
I would not expect it to. This dataset has strange encoding to work around a specific technical problem with IRC compatibility. You should use the code from the links I posted above to make sure the data is being properly decoded.
i see. thank you very much!