llmware icon indicating copy to clipboard operation
llmware copied to clipboard

Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text

Open wissamharoun opened this issue 1 year ago • 1 comments

Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text

simply as described. appears to be more associated with the parsing of pdf documents that have entire pages comprised of a scanned image are these types of record included in embedding? if so, problematic, right?

macos 15.x llmware v 0.3.8 active_db: sqlite

wissamharoun avatar Nov 26 '24 20:11 wissamharoun

@wissamharoun - thanks for this detailed feedback, and yes, I confirm that there are scenarios in which the parser may create a text chunk larger than the requested max text chunk. I would encourage you to look at this example (if you have not already) - pdf_parser_configs ... The most common situations are with an embedded scanned image or a table where it is difficult to apply a hard cut-off at a specific character limit. Depending upon your use case, you may have to build some custom limit handling or safeguards in the downstream processing. Based on what problems you may be experiencing, we can make enhancements to llmware - and happy to work with you on it - just let me know what specific challenges it is creating in your use case.

doberst avatar Dec 02 '24 17:12 doberst