PileOfLaw icon indicating copy to clipboard operation
PileOfLaw copied to clipboard

Data encoding problem: text is stringified-bytes like "b'JEANNE D\xe2...'"

Open Yaakov-Belch opened this issue 9 months ago • 0 comments

The credit card agreements scraping process apparently produced binary data that is stored in text strings like "b'JEANNE D\xe2\x80\x99ARC CREDIT UNION\n...'".

Note that this is a str that contains the representation of binary data, not binary data itself (as binary data cannot be stored in json).

I noticed this in the first ten lines of the file data/train.cfpb_cc.jsonl.xz and suspect that it affects all the records in that file.

Yaakov-Belch avatar May 10 '24 10:05 Yaakov-Belch