python-for-data-and-media-communication-gitbook fail to read Chinese words pdfs with decoding error

fail to read Chinese words pdfs with decoding error

Open ChicoXYC opened this issue 6 years ago • 9 comments

Target

We want to extract all the text from thousands of pdfs.

Problem

Decoding problem - Cannot read pdfs with Chinese words

Following are the trying examples we have made so far, but there is some encoding error, you can refer the following for details http://nbviewer.jupyter.org/github/ChicoXYC/exercise/blob/master/get-text-from-pdf/read-chinese-pdf-with-encoding-error.ipynb

the pdfs can be found here: https://github.com/ChicoXYC/exercise/tree/master/get-text-from-pdf/pdfs

Dec 13 '18 07:12 ChicoXYC

@hupili Could you please help us ?

Dec 13 '18 07:12 lullabymia

may try other tools like “pandoc”

Dec 14 '18 03:12 hupili

Does pandoc also accept files in commonmark, creole, docbook, docx, epub, fb2, gfm, haddock, html, jats, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki (no pdf) 2018-12-14 6 10 12 @hupili https://nbviewer.jupyter.org/github/lullabymia/example/blob/master/Untitled.ipynb

Dec 14 '18 10:12 lullabymia

How about following two resources?

https://www.jianshu.com/p/31939ee6f1c9
https://github.com/wshuyi/demo-pdf-content-extract-batch-python-pdfminer/blob/master/pdf_extractor.py

Dec 14 '18 16:12 hupili

I‘ve already tried this method before, and tested it again today, didn't work out.

from pdfminer.pdfpage import PDFPage

error:

 No module named 'pdfminer.pdfpage'

and

TypeError: __init__() got an unexpected keyword argument 'codec'
# didn't find the solution

Dec 15 '18 14:12 ChicoXYC

how about importing pdfminer alone?

Dec 15 '18 14:12 hupili

Yes, I tried this module alone. Same problem above. Didn't work out for Chinese

Dec 15 '18 14:12 ChicoXYC

do you mean pdfminer works for English?

Dec 15 '18 15:12 hupili

NO. english don't work either. It seems that the module function has been changed and I haven't found the answer. But pypdf2 works for English.

Dec 17 '18 02:12 ChicoXYC

python-for-data-and-media-communication-gitbook python-for-data-and-media-communication-gitbook copied to clipboard

fail to read Chinese words pdfs with decoding error

Target

Problem

python-for-data-and-media-communication-gitbook
python-for-data-and-media-communication-gitbook copied to clipboard