jbarlow

Results 352 comments of jbarlow

It seems that Ghostscript's PDF/A conversion removes links, even with `-dPrinted=false` which as [explained here](https://bugs.ghostscript.com/show_bug.cgi?id=699830#c16) should prevent links from being deleted. I suppose my answer needs to be that, if...

Agreed this is nice to have. The difficulty is there's a plethora of possible barcode formats, locations to search for them, and actions to take based on them being recognized...

Kind of? This ambiguity is not intended, but it looks to me like it's going to end up doing the right thing anyway. If the first `if` is executed, then...

I think the existing behavior is the correct behavior. `--skip-text` means processing on pages that have text is skipped. The intended use case is a PDF that contains a mixture...

ocrmypdf has three [PDF renderers](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#changing-the-pdf-renderer). One of them is called the `hocr` renderer and uses HOCR as an intermediate format. For your use case it might make the most sense...

Ooh, interesting. Thanks for doing this investigation. Can you share the PDF? This is very likely going to have to do with details of how the previous PDF is formatted....

Re the workaround: ~~There are multiple programs called "pdftoimages" based on my search.~~ You do get better quality by extracting images from PDFs and applying OCR to those, but that...

Incorrect spacing between letters/words is a longstanding problem in Tesseract and I've worked on it over there. It has to do with PDF being a print production file format, with...

I don't quite understand what you're trying to do. Other than ALTO, what inputs do you have (do you have the original PDFs the ALTO was derived from), and what...

Okay, now I see what you're trying to go. There isn't any way to do without some new feature development. You could write a fake OCR engine plugin that does...