amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

Highlighted text get appended with word SELECTED in ouput

Open sawasume opened this issue 2 years ago • 3 comments

Textract response library appends the text SELECTED when something is highlighted in the text shown below are the example The original doc

doc_og

This is how the output looks like

selected_Capture

Code to generate the above output

code-print

sawasume avatar Oct 31 '23 13:10 sawasume

Looks like checkbox/marked identification from checkbox model, which is part of TABLES and FORMS. Those are printed out as part of the rendering when available. No parameter right now to turn them off unfortunately. Workaround could be to filter the SELECTION_ELEMENTS out in the JSON before sending to trp, till we make a param available.

schadem avatar Oct 31 '23 15:10 schadem

@athewsey I'm running into a similar but opposite issue when it comes to selections:

Image

SELECTED/NOT_SELECTED gets detected properly under Forms, but doesn't show up properly (shows up as an empty figure div) when rendered to HTML:

<p>
        What type of financing is the Borrower seeking? *
</p>
<p>
        Life
</p>
<p>
        CMBS
</p>
<div class="figure"></div>
<p>
        Agency
</p>
<div class="figure"></div>
<p>
        Bridge
</p>
<div class="figure"></div>
<p>
        Bank
</p>
<div class="figure"></div>
<p>
        Credit Union
</p>
<div class="figure"></div>
<p>
        Non-Bank
</p>

AChangXD avatar Feb 07 '25 06:02 AChangXD

Hi @sawasume, sorry just to check - are you using TRP in JavaScript? Or Python?

If Python then can ignore the rest of this message, but if JS then there's some fact-finding that'd be useful:


In this case from the generated HTML it looks like the items aren't getting detected as checkboxes at all: as I'd expect to see an <input tag if they were...

  1. Did you run the document analysis with the FORMS feature enabled in that case?
  2. (With forms enabled) Can you find the relevant checkboxes under page.form in the result?
  3. If the K-V detections aren't being merged properly into those HTML entries there - do you find them duplicated anywhere else in the HTML output?
  4. Any chance you could locate the block for one of them (say, CMBS) in the raw data, and check that it's linked as a "relationship" from both A) a key value set block and B) a LINE which is in turn referenced by some kind of LAYOUT_* block?

athewsey avatar Feb 18 '25 10:02 athewsey