content-extractor
content-extractor copied to clipboard
Handle non-ASCII documents and JSON outputs
I'm not sure how the original code could handle UTF-8 input files. Buffering characters in Unicode ensured I could convert mine, producing UTF-8 JSON output (io.cStringIO does accept Unicode while cStringIO doesn't). This PR also fixes indent inconsistency introduced by the previous PR and a check for bad (malformed?) font names.