marker icon indicating copy to clipboard operation
marker copied to clipboard

OCR_ENGINE=None Doesn't work

Open svmrw opened this issue 1 year ago • 5 comments

Hello. The Readme says the following:

By default, marker will use surya for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set OCR_ENGINE to ocrmypdf. This also requires external dependencies (see above). If you don't want OCR at all, set OCR_ENGINE to None.

export OCR_ENGINE=None
marker_single ./file.pdf ./marker

Running the command gives the following:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
OCR_ENGINE
  Input should be 'surya' or 'ocrmypdf' [type=literal_error, input_value='None', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/literal_error

I really want to convert pdf to markdown, but not use OCR. Almost all pdf files have text that can be selected and copied, and embedded images need to be kept original. It seems to me that the whole document does not need to be recognized as an image if the text is easy to copy.

Please tell me, is this somehow possible or impossible? Maybe it was supported before, but now it is not? Or maybe I am doing something wrong? Thanks.

svmrw avatar Aug 16 '24 19:08 svmrw

#257 I tried to make changes manually based on your commit. The error is no longer displayed, but... OCR Surya still loads and recognizes the whole file. Ie: OCR_ENGINE=None and OCR_ENGINE=Surya work the same. No changes are visible. I most likely assume that I am doing something wrong, so I ask you to check it yourself.

svmrw avatar Aug 18 '24 13:08 svmrw

Running into the same and as OCR runs my machine into max memory, I need to use a different software now.. dead end

kyr0 avatar Sep 17 '24 22:09 kyr0

The problem is still relevant. Changes from here did not help at all either.

Personally, I don't care about performance. The thing is that OCR recognition spoils embedded images. So I would like OCR_ENGINE=None to work.

svmrw avatar Oct 06 '24 20:10 svmrw

I'm also wondering if I can disable the use of OCR.

The PDF I'm converting is a tutorial with screenshots of desktop interface. It looks like for some images, the OCR is thinking they are tables inside the PDF and replacing the screenshot images with markdown tables (incorrect).

I need to ensure all images inside the PDF are not attempted to accidently be converted to markdown.

Perhaps my issue is with https://github.com/VikParuchuri/marker/blob/228a7ba9a91e4ff24654e640484d3598b8c16f9a/marker/processors/table.py

Not exactly sure how I can disable processing tables.

KamilDev avatar Jan 28 '25 05:01 KamilDev

had you fixed ,i met the same problem

erickech avatar Apr 11 '25 02:04 erickech

I encountered the same problem as @KamilDev .

I changed the code as follows and it worked fine.

        def retain_images(self, layout: LayoutResult):
            for b in layout.bboxes:
                if b.label in ["Table", "Form"]:
                    b.label = "Figure"
            return layout

        layout_results_img = [
            retain_images(l) for l in layout_results
        ]
        return layout_results_img

https://github.com/VikParuchuri/marker/blob/228a7ba9a91e4ff24654e640484d3598b8c16f9a/marker/builders/layout.py#L101

nshun avatar Jun 06 '25 11:06 nshun