python-documentai-toolbox icon indicating copy to clipboard operation
python-documentai-toolbox copied to clipboard

`split_pdf` splits too much, since it does not take into account that different entities might have same type (but different confidence)

Open evekhm opened this issue 1 year ago • 1 comments

Here is entities example returned from splitter:

[text_anchor {
  text_segments {
    end_index: 1424
  }
}
type_: "form1"
confidence: 0.96
page_anchor {
  page_refs {
  }
  page_refs {
    page: 1
  }
  page_refs {
    page: 2
  }
}
, text_anchor {
  text_segments {
    start_index: 1424
    end_index: 6935
  }
}
type_: "form1"
confidence: 0.68
page_anchor {
  page_refs {
    page: 3
  }
  page_refs {
    page: 4
  }
}
]

In this case we see that all pages are actually of same type and we should not split. However document.Document.split_pdf would not detect that.

evekhm avatar Jul 11 '24 22:07 evekhm

Ok, this is a bit complicated because the Document AI Custom Splitter specifically detected those two "form1" entries as separate documents.

If we combine them together by default, it could create ambiguity when there are multiple separate documents of the same type in a file.

We could create a parameter like combine_like_document_types or something like that, but I think this issue would be best resolved on the Custom Splitter itself.

holtskinner avatar Jul 15 '24 15:07 holtskinner