amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

JS - Clarification on Rendering Signatures in HTML Output and Markdown Support

Open serhii12 opened this issue 10 months ago • 1 comments

Hi,

I'm currently using the following code to generate HTML content while skipping certain block types:

const htmlContent = page.html({
    skipBlockTypes: ['LAYOUT_FOOTER' as ApiBlockType, 'LAYOUT_FIGURE' as ApiBlockType],
});

However, I noticed that signatures are not included in the HTML output. To handle this, I found that I need to manually search for signature blocks using:

const signatureBlocks = page.listSignatures().map(signature => {
    return signature.str(); // Placeholder like [SIGNATURE]
});

Alternatively, I was exploring the idea of generating a full page in Markdown for more flexibility. I was also debating on using https://github.com/mixmark-io/turndown to covert html to makrdown.

For instance, using getLinesByLayoutArea to segment the page covert each page into html and into markdown after:

const segmented = page.getLinesByLayoutArea(true); // Sort lines in reading order

console.log("---- HEADER:");
console.log(segmented.header.map((l) => l.text).join("\n"));
console.log("\n---- CONTENT:");
console.log(segmented.content.map((l) => l.text).join("\n"));
console.log("\n---- FOOTER:");
console.log(segmented.footer.map((l) => l.text).join("\n"));

While this works, rendering the signature within the HTML feels a bit cumbersome. Is there a more streamlined way to include signature blocks in the HTML? Additionally, I was wondering if there are any plans to introduce a Markdown rendering option for JavaScript in the near future. It could be helpful for those of us working with simpler, more flexible outputs.

serhii12 avatar Jan 14 '25 16:01 serhii12

Thanks for raising @serhii12

Firstly on the subject of .markdown(): We're definitely still interested, and the bottom-up implementation pattern starting from IRenderable should work pretty much the same for implementing it as .html() does everywhere else... But as discussed in #173 (almost a year ago now but still pretty much the same situation 😭), the complexity of tables in particular has been a bit of a blocker. If you have feedback on how the challenges talked about there would be best solved for you, it'd be great if you could share it in that other issue! For example: How valuable would a Markdown implementation that still mostly [or even always?] rendered Tables in HTML be? If we try to render MD tables, should we just accept they'll be unreadable in general so optimize for smallest token count? etc. Maybe we could try to start with a very basic version that handles tables badly, and try to improve it incrementally over time?

On signatures:

This is an interesting call-out 🤔 From the Textract docs I see there's no dedicated LAYOUT_SIGNATURE (or similar) Block in Layout responses. From the few docs in our unit tests, it looks like the only parents referencing SIGNATURE blocks are 1/ the PAGE and 2/ KEY_VALUE_SETs (i.e. the signatures are in the value of a K-V pair).

...So I expect the signatures aren't pulling through because nothing in the Layout response is referencing them, and at the moment it's Layout.html() that drives the serialization.

  1. Have you / could you try enabling the FORMS feature in Textract? If your signatures (like the ones I see in test) are detected as K-V values, then I think this should pull them through to the HTML output? [Yes I understand this carries extra cost so may not be ideal]
  2. In the raw API response for (a couple of) your documents, could you check what other block types have references to your SIGNATUREs' block IDs? Are you also just seeing PAGE, KEY_VALUE_SET and nothing else? Or you see any other blocks referencing the signatures as well?

If nothing in the layout flow references the SIGNATUREs, it'll be difficult for TRP to place them correctly in the page's flow.

athewsey avatar Jan 20 '25 15:01 athewsey