python-documentai-toolbox
python-documentai-toolbox copied to clipboard
`split_pdf` splits too much, since it does not take into account that different entities might have same type (but different confidence)
Here is entities example returned from splitter:
[text_anchor {
text_segments {
end_index: 1424
}
}
type_: "form1"
confidence: 0.96
page_anchor {
page_refs {
}
page_refs {
page: 1
}
page_refs {
page: 2
}
}
, text_anchor {
text_segments {
start_index: 1424
end_index: 6935
}
}
type_: "form1"
confidence: 0.68
page_anchor {
page_refs {
page: 3
}
page_refs {
page: 4
}
}
]
In this case we see that all pages are actually of same type and we should not split. However document.Document.split_pdf would not detect that.
Ok, this is a bit complicated because the Document AI Custom Splitter specifically detected those two "form1" entries as separate documents.
If we combine them together by default, it could create ambiguity when there are multiple separate documents of the same type in a file.
We could create a parameter like combine_like_document_types or something like that, but I think this issue would be best resolved on the Custom Splitter itself.