unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat(chunk): split tables on even row boundaries

Open scanny opened this issue 1 year ago • 0 comments

Summary Use more sophisticated algorithm for splitting oversized Table elements into TableChunk elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable.

Additional Context Table splitting now has the following characteristics:

  • TableChunk.metadata.text_as_html is always a parseable HTML <table> subtree.
  • TableChunk.text is always the text in the HTML version of the table fragment in .metadata.text_as_html. Text and HTML are "synchronized".
  • The table is divided at a whole-row boundary whenever possible.
  • A row is broken at an even-cell boundary when a single row is larger than the chunking window.
  • A cell is broken at an even-word boundary when a single cell is larger than the chunking window.
  • .text_as_html is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.

scanny avatar Aug 09 '24 17:08 scanny