OCR_ENGINE=None Doesn't work
Hello. The Readme says the following:
By default, marker will use surya for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set OCR_ENGINE to ocrmypdf. This also requires external dependencies (see above). If you don't want OCR at all, set
OCR_ENGINEtoNone.
export OCR_ENGINE=None
marker_single ./file.pdf ./marker
Running the command gives the following:
pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
OCR_ENGINE
Input should be 'surya' or 'ocrmypdf' [type=literal_error, input_value='None', input_type=str]
For further information visit https://errors.pydantic.dev/2.8/v/literal_error
I really want to convert pdf to markdown, but not use OCR. Almost all pdf files have text that can be selected and copied, and embedded images need to be kept original. It seems to me that the whole document does not need to be recognized as an image if the text is easy to copy.
Please tell me, is this somehow possible or impossible? Maybe it was supported before, but now it is not? Or maybe I am doing something wrong? Thanks.
#257 I tried to make changes manually based on your commit. The error is no longer displayed, but... OCR Surya still loads and recognizes the whole file. Ie: OCR_ENGINE=None and OCR_ENGINE=Surya work the same. No changes are visible. I most likely assume that I am doing something wrong, so I ask you to check it yourself.
Running into the same and as OCR runs my machine into max memory, I need to use a different software now.. dead end
The problem is still relevant. Changes from here did not help at all either.
Personally, I don't care about performance. The thing is that OCR recognition spoils embedded images. So I would like OCR_ENGINE=None to work.
I'm also wondering if I can disable the use of OCR.
The PDF I'm converting is a tutorial with screenshots of desktop interface. It looks like for some images, the OCR is thinking they are tables inside the PDF and replacing the screenshot images with markdown tables (incorrect).
I need to ensure all images inside the PDF are not attempted to accidently be converted to markdown.
Perhaps my issue is with https://github.com/VikParuchuri/marker/blob/228a7ba9a91e4ff24654e640484d3598b8c16f9a/marker/processors/table.py
Not exactly sure how I can disable processing tables.
had you fixed ,i met the same problem
I encountered the same problem as @KamilDev .
I changed the code as follows and it worked fine.
def retain_images(self, layout: LayoutResult):
for b in layout.bboxes:
if b.label in ["Table", "Form"]:
b.label = "Figure"
return layout
layout_results_img = [
retain_images(l) for l in layout_results
]
return layout_results_img
https://github.com/VikParuchuri/marker/blob/228a7ba9a91e4ff24654e640484d3598b8c16f9a/marker/builders/layout.py#L101