Support different formats in method export_to_hocr
Hi,
I have found a backward incompatibility regarding to the change included in the 2a0 version. When using the wrapper.document.py. export_hocr_str() method, the hocr output format is different to previous versions.
This fix changes the embedding of the "ocr_line" class in the hocr output, including now the "ocrx_word" classes that compounds the line. This fix makes the output more logical and according to the spec, however all trained processors based in the previous hocr format are now misfunctioning.
I want to propose a solution where the method allows to export the hocr in the previous format, keeping the current format as the default. Something the way:
def export_hocr_str(self, title: str, inline_words = true : bool) -> str:
environment = Environment(
loader=PackageLoader("google.cloud.documentai_toolbox", "templates")
)
template = environment.get_template("hocr_document_template_inline_words.xml.j2")
if not inline_words:
template = environment.get_template("hocr_document_template.xml.j2")
content = template.render(pages=self.pages, title=title)
return content
Not the best code, as i consider that maybe the j2 allows this kind of parameterization to change its output, but the most straightforward than i can think of.
Thanks for your attention.