jbarlow

Results 380 comments of jbarlow

That looks really interesting. This is the sort of thing I'd like to use the forthcoming plugin architecture for, if/when I get a chance to finish it.

Yes, you could write a plugin that hooks `filter_page_image` for example.

I think this would be great but there's a lot to do to make it work, especially to support after the fact editing.

ocrmypdf already has the ability to merge hOCR HTML into PDF through its public APIs. What it does not have is a convenient way to run its post-processing on a...

@tukusejssirs The relevant code is in hocrtransform.py. See `python -m ocrmypdf.hocrtransform --help`.

No, it doesn't have that ability, but you could split the hOCR and run a loop.

It looks like the XML (`024_hocr.html`) is invalid, specifically at line 45.

ocrmypdf.hocrtransform is only capable of parsing the subset of hOCR generated by Tesseract. For this specific case, you'll need to add a string like the following to the top of...

(Note that doctype signature may actually be incorrect for hOCR; whatever the hOCR spec says is correct should be used.)

Official definition is ```xml ``` From: https://www.w3.org/TR/html4/sgml/entities.html