camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Can't read Hebrew PDF

Open ShayHa opened this issue 5 years ago • 3 comments

Describe the bug I am trying to parse a pdf written in Hebrew but all I get is many "cid:###" image

Steps to reproduce the bug Steps used to install camelot: I install it via conda prompt: conda install -c conda-forge camelot-py

Expected behavior Extract the words from the pdf

import camelot tables = camelot.read_pdf(r"file.pdf")

PDF Unfortunately, I cannot share my PDF but it has table and Hebrew words.

Environment

  • OS: [e.g. MacOS]
  • Python version: 3.7.4
  • Numpy version: 1.18.1
  • OpenCV version: 4.1.2.30
  • Ghostscript version: not sure how to check
  • Camelot version: 0.8.2

ShayHa avatar Nov 30 '20 10:11 ShayHa

Please read this (#199).

It seems that your PDF is missing character mappings. It's not a Camelot bug, I think.

anakin87 avatar Dec 01 '20 11:12 anakin87

Yes, it's not a Camelot bug according to the other topic and SOF link. I guess that cannot find a workaround that does not include OCR? I managed to extract the text using tesseract but it ruins the tables and gives me text only.

ShayHa avatar Dec 01 '20 11:12 ShayHa

Sorry, I never found other solution than OCR for this problem.

anakin87 avatar Dec 01 '20 11:12 anakin87