doc2text icon indicating copy to clipboard operation
doc2text copied to clipboard

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Results 15 doc2text issues
Sort by recently updated
recently updated
newest added

tesseract seems to be able to produce PDFs these days with text overlaid on the image. This is useful for searching int he PDF when viewing that way. It'd be...

On calling the process_image() method, the image to be processed is not cropped accurately (attached below). Which (and how) of the calls in the method will I need to modify...

I have installed doc2text and required packages but when I try to import doc2text it gives me error no module name PythonMagick. ![2021-07-21](https://user-images.githubusercontent.com/33904670/126509150-05030fcd-1f07-43b8-9165-7c40716a4802.png)

I'm having a flask app which gets the file from the api and want to get the text out of it , but i don't want to save it on...

Seems library not 100% python3 compatible. When I'm tying to run simple code: ``` import doc2text doc = doc2text.Document() doc = doc2text.Document(lang="eng") doc.read('pdf-sample.pdf') ``` I'm getting ``` Traceback (most recent...

Adds xrange replacement for python3

``` Traceback (most recent call last): File "test.py", line 1, in import doc2text File "/Users/Stan/Downloads/doc2text-master/doc2text/__init__.py", line 6, in import PyPDF2 as pyPdf ModuleNotFoundError: No module named 'PyPDF2' ```

When I am trying to call doc.extract_text() it gives error file not found. I'm using windows 10 and Python 3.6 and Jupyter

Thank you for this fantastic utility. Text extraction is not successful for any png image with texts. The jpg and pdf works. Is this a known issue and will there...

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25 dst is not a numpy array, neither a scalar Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211 dst is not a numpy array, neither a scalar...