python-for-data-and-media-communication-gitbook
python-for-data-and-media-communication-gitbook copied to clipboard
fail to read Chinese words pdfs with decoding error
Target
We want to extract all the text from thousands of pdfs.
Problem
Decoding problem - Cannot read pdfs with Chinese words
Following are the trying examples we have made so far, but there is some encoding error, you can refer the following for details http://nbviewer.jupyter.org/github/ChicoXYC/exercise/blob/master/get-text-from-pdf/read-chinese-pdf-with-encoding-error.ipynb
the pdfs can be found here: https://github.com/ChicoXYC/exercise/tree/master/get-text-from-pdf/pdfs
@hupili Could you please help us ?
may try other tools like “pandoc”
Does pandoc also accept files in commonmark, creole, docbook, docx, epub, fb2, gfm, haddock, html, jats, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
(no pdf)
@hupili
https://nbviewer.jupyter.org/github/lullabymia/example/blob/master/Untitled.ipynb
How about following two resources?
- https://www.jianshu.com/p/31939ee6f1c9
- https://github.com/wshuyi/demo-pdf-content-extract-batch-python-pdfminer/blob/master/pdf_extractor.py
I‘ve already tried this method before, and tested it again today, didn't work out.
from pdfminer.pdfpage import PDFPage
error:
No module named 'pdfminer.pdfpage'
and
TypeError: __init__() got an unexpected keyword argument 'codec'
# didn't find the solution
how about importing pdfminer alone?
Yes, I tried this module alone. Same problem above. Didn't work out for Chinese
do you mean pdfminer works for English?
NO. english don't work either. It seems that the module function has been changed and I haven't found the answer. But pypdf2 works for English.