Converting simple english CID's to ascii characters

Open tosiecki opened this issue 6 years ago • 1 comments

Hi euske,

Nice module, love it. I tried looking through other people's posts regarding CID's but it didn't help much. I have a plain english pdf and the conversion looks good, I can even figure out by hand what certain CID's map to in english characters. But my attempts to convert every CID are failing and I'm stuck. I did the make cmap step and it downloaded a bunch of files, but when I rerun the pdf2txt.py, I still get a bunch of CID's. The code must have had this mapping somewhere to produce them. How do I simply convert CID's in my file to english characters?

Tom.

Nov 08 '19 20:11 tosiecki

First, can you try it with the latest version? Its installation is more automatic now so that you don't screw up with the cmap step.

Second, even if you did everything correctly, some PDFs still might fail in that it has "unresolved" CIDs, because not all the texts are expressed in the same manner. Think of PDF as more of a graphics format rather than text format. A certain thing that might look like a text could be actually represented as a bunch of pictograms with no mapping to real characters provided. In that case, you'll still get random CID numbers.

Nov 09 '19 03:11 euske