python-documentai-toolbox icon indicating copy to clipboard operation
python-documentai-toolbox copied to clipboard

Support different formats in method export_to_hocr

Open smanero opened this issue 2 years ago • 0 comments

Hi,

I have found a backward incompatibility regarding to the change included in the 2a0 version. When using the wrapper.document.py. export_hocr_str() method, the hocr output format is different to previous versions.

This fix changes the embedding of the "ocr_line" class in the hocr output, including now the "ocrx_word" classes that compounds the line. This fix makes the output more logical and according to the spec, however all trained processors based in the previous hocr format are now misfunctioning.

I want to propose a solution where the method allows to export the hocr in the previous format, keeping the current format as the default. Something the way:

def export_hocr_str(self, title: str, inline_words = true : bool) -> str:
        environment = Environment(
            loader=PackageLoader("google.cloud.documentai_toolbox", "templates")
        )
        template = environment.get_template("hocr_document_template_inline_words.xml.j2")
        if not inline_words:
           template = environment.get_template("hocr_document_template.xml.j2")
        content = template.render(pages=self.pages, title=title)
        return content

Not the best code, as i consider that maybe the j2 allows this kind of parameterization to change its output, but the most straightforward than i can think of.

Thanks for your attention.

smanero avatar Nov 03 '23 15:11 smanero