llmware icon indicating copy to clipboard operation
llmware copied to clipboard

problems with library importing pdf file with mongo db as library repository: header_text is appended to text, header_text does not appear to be cleared properly

Open JEHollierJr opened this issue 1 year ago • 2 comments

I am seeing a problem with one of my PDF documents added to a library, where in the MONGO DB entries i see what appears to be the field header_text appended to the text field in the text field, so what I see is a concatenation of text + header_text. In addition, it looks like the header_text may not be getting cleared properly and i see repeating entries of the header in header_text, which gets appended to the text field. What i see in successive chunks in the MONGO DB is a the next chunk of text with the header_text appended, with the header repeating in the header_text and growing.

What I see in the other PDF documents is the header_text field is empty, so I am not seeing the problem in other documents.

If I use the Parser.parse_one() to parse the pdf document, then I don't see the header_text getting appended to the text field in the resulting parsed_document.

I found this by comparing the results from the parse_one() where I concatenated the text fields into a text document and the results from the library where I did the same. The size of the resulting document from the library was about 5 times as large as the document results from Parser.parse_one(), and the document from the library had blocks of header_text throughout the document.

Let me know if more clarification is needed.

John

JEHollierJr avatar Jan 19 '24 12:01 JEHollierJr

I see this in v1.14 and v1.15.

JEHollierJr avatar Jan 19 '24 12:01 JEHollierJr

And, i am working on Windows 11 pro.

JEHollierJr avatar Jan 19 '24 12:01 JEHollierJr

@JEHollierJr - sorry for the delay in closing this - this issue has been fixed in v0.2.7, which is merged in the main branch if you clone the repo, or in the latest pip install too. We made several fixes and improvements with the "header_text" which consists of Bold, Italics and Large Font (18 pt+) - including the ability to turn off capturing header_text in the parse. Thanks again for raising this issue - and please share any feedback on the new fix. All the best, Darren

doberst avatar Apr 04 '24 19:04 doberst