Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text
Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text
simply as described. appears to be more associated with the parsing of pdf documents that have entire pages comprised of a scanned image are these types of record included in embedding? if so, problematic, right?
macos 15.x llmware v 0.3.8 active_db: sqlite
@wissamharoun - thanks for this detailed feedback, and yes, I confirm that there are scenarios in which the parser may create a text chunk larger than the requested max text chunk. I would encourage you to look at this example (if you have not already) - pdf_parser_configs ... The most common situations are with an embedded scanned image or a table where it is difficult to apply a hard cut-off at a specific character limit. Depending upon your use case, you may have to build some custom limit handling or safeguards in the downstream processing. Based on what problems you may be experiencing, we can make enhancements to llmware - and happy to work with you on it - just let me know what specific challenges it is creating in your use case.