PileOfLaw
PileOfLaw copied to clipboard
Data encoding problem: text is stringified-bytes like "b'JEANNE D\xe2...'"
The credit card agreements scraping process apparently produced binary data that is stored in text strings like "b'JEANNE D\xe2\x80\x99ARC CREDIT UNION\n...'"
.
Note that this is a str
that contains the representation of binary data, not binary data itself (as binary data cannot be stored in json).
I noticed this in the first ten lines of the file data/train.cfpb_cc.jsonl.xz
and suspect that it affects all the records in that file.