unstructured
unstructured copied to clipboard
feat(chunk): split tables on even row boundaries
Summary
Use more sophisticated algorithm for splitting oversized Table elements into TableChunk elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable.
Additional Context Table splitting now has the following characteristics:
-
TableChunk.metadata.text_as_htmlis always a parseable HTML<table>subtree. -
TableChunk.textis always the text in the HTML version of the table fragment in.metadata.text_as_html. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger than the chunking window.
-
.text_as_htmlis "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.