the-pile Ubuntu IRC broken encoding, impacting generative models downstream

The Ubuntu IRC dataset appears to contain broken character encoding, which noticeably impacts generated output from models trained on The Pile in certain situations.

For example, from https://irclogs.ubuntu.com/2020/08/23/%23ubuntu.txt This file contains Â¯\_(ãƒ„)_/Â¯ which should instead show as ¯\_(ツ)_/¯, if it were properly encoded.

I can't currently inspect the data directly in The Pile, because the-eye.eu and eaidata.bmk.sh are both inaccessible right now. However, I have seen lots of garbled output from GPT-J that looks remarkably similar to this broken encoding, e.g. Â¯_(ã)_/Â¯

It looks like this dataset could be cleaned by using the ftfy python library. https://ftfy.readthedocs.io/en/latest/ In my very brief testing, this appears to fix the broken encoding from the file linked above.

Jan 19 '23 05:01 briansemrau

~Could we download them again without errors, or are they gone?~ So my guess is that is a utf8-to-ascii error. Maybe the server is messing with the encoding? try to request utf8 when doing the GET request.

Jan 19 '23 06:01 Mistobaan

I don't believe you can specify character encoding in HTTP requests. I'll try to contact the author of the bot that scrapes for irclogs.ubuntu.com to get some insight, or report a bug (no way the data has been encoded wrong for over a decade, right?...)

Jan 19 '23 07:01 briansemrau

Found the solution. The .txt files are mixed encoding, line-by-line.

This dataset must be properly decoded before use. This can be done fairly simply:

https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L199-L208

https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L141-L154

Jan 19 '23 18:01 briansemrau

@briansemrau do you know if huggingface would decode this properly? i'm not sure where i should look into from https://github.com/huggingface/datasets/tree/main/src/datasets/utils

Apr 10 '23 19:04 keunwoochoi

do you know if huggingface would decode this properly?

I would not expect it to. This dataset has strange encoding to work around a specific technical problem with IRC compatibility. You should use the code from the links I posted above to make sure the data is being properly decoded.

Apr 10 '23 19:04 briansemrau

i see. thank you very much!

Apr 10 '23 19:04 keunwoochoi

the-pile the-pile copied to clipboard

Ubuntu IRC broken encoding, impacting generative models downstream

the-pile
the-pile copied to clipboard