content-extractor Handle non-ASCII documents and JSON outputs

Handle non-ASCII documents and JSON outputs

Open ymollard opened this issue 7 years ago • 0 comments

I'm not sure how the original code could handle UTF-8 input files. Buffering characters in Unicode ensured I could convert mine, producing UTF-8 JSON output (io.cStringIO does accept Unicode while cStringIO doesn't). This PR also fixes indent inconsistency introduced by the previous PR and a check for bad (malformed?) font names.

Jan 25 '18 14:01 ymollard

content-extractor content-extractor copied to clipboard

Handle non-ASCII documents and JSON outputs

content-extractor
content-extractor copied to clipboard