amazon-textract-response-parser
amazon-textract-response-parser copied to clipboard
JS - Clarification on Rendering Signatures in HTML Output and Markdown Support
Hi,
I'm currently using the following code to generate HTML content while skipping certain block types:
const htmlContent = page.html({
skipBlockTypes: ['LAYOUT_FOOTER' as ApiBlockType, 'LAYOUT_FIGURE' as ApiBlockType],
});
However, I noticed that signatures are not included in the HTML output. To handle this, I found that I need to manually search for signature blocks using:
const signatureBlocks = page.listSignatures().map(signature => {
return signature.str(); // Placeholder like [SIGNATURE]
});
Alternatively, I was exploring the idea of generating a full page in Markdown for more flexibility. I was also debating on using https://github.com/mixmark-io/turndown to covert html to makrdown.
For instance, using getLinesByLayoutArea to segment the page covert each page into html and into markdown after:
const segmented = page.getLinesByLayoutArea(true); // Sort lines in reading order
console.log("---- HEADER:");
console.log(segmented.header.map((l) => l.text).join("\n"));
console.log("\n---- CONTENT:");
console.log(segmented.content.map((l) => l.text).join("\n"));
console.log("\n---- FOOTER:");
console.log(segmented.footer.map((l) => l.text).join("\n"));
While this works, rendering the signature within the HTML feels a bit cumbersome. Is there a more streamlined way to include signature blocks in the HTML? Additionally, I was wondering if there are any plans to introduce a Markdown rendering option for JavaScript in the near future. It could be helpful for those of us working with simpler, more flexible outputs.
Thanks for raising @serhii12
Firstly on the subject of .markdown(): We're definitely still interested, and the bottom-up implementation pattern starting from IRenderable should work pretty much the same for implementing it as .html() does everywhere else... But as discussed in #173 (almost a year ago now but still pretty much the same situation 😭), the complexity of tables in particular has been a bit of a blocker. If you have feedback on how the challenges talked about there would be best solved for you, it'd be great if you could share it in that other issue! For example: How valuable would a Markdown implementation that still mostly [or even always?] rendered Tables in HTML be? If we try to render MD tables, should we just accept they'll be unreadable in general so optimize for smallest token count? etc. Maybe we could try to start with a very basic version that handles tables badly, and try to improve it incrementally over time?
On signatures:
This is an interesting call-out 🤔 From the Textract docs I see there's no dedicated LAYOUT_SIGNATURE (or similar) Block in Layout responses. From the few docs in our unit tests, it looks like the only parents referencing SIGNATURE blocks are 1/ the PAGE and 2/ KEY_VALUE_SETs (i.e. the signatures are in the value of a K-V pair).
...So I expect the signatures aren't pulling through because nothing in the Layout response is referencing them, and at the moment it's Layout.html() that drives the serialization.
- Have you / could you try enabling the
FORMSfeature in Textract? If your signatures (like the ones I see in test) are detected as K-V values, then I think this should pull them through to the HTML output? [Yes I understand this carries extra cost so may not be ideal] - In the raw API response for (a couple of) your documents, could you check what other block types have references to your
SIGNATUREs' block IDs? Are you also just seeingPAGE,KEY_VALUE_SETand nothing else? Or you see any other blocks referencing the signatures as well?
If nothing in the layout flow references the SIGNATUREs, it'll be difficult for TRP to place them correctly in the page's flow.