unstract fix: [PDF document with 81 pages being indexed into 1 node in Qdrant and Postgres, missing 99% of the document after "successfully indexed"]

Describe the bug

When trying the community version, after connecting successfully an Azure LLM, Qdrant connection, and Llamaparse connection, I have tested by uploading a single document and clicking "index". It shows that it has successfully indexed the document, but only with "1 node". Upon further investigating, the Qdrant vector db has only a single indexed node with only the first title page text of the document. No other parts of the document are indexed.

To reproduce

Using Azure LLM, llamaparse, and Qdrant, then uploading a PDF with chunk_size = 1024 and overlap = 128 then pressing index.

Expected behavior

I would expect to see thousands of nodes in my Qdrant vector db of the successfully parsed/split document.

Environment details

Version: v0.101.6

Screenshots

Full log:

Parsing nodes: 100% 1/1:

Qdrant collection with 1 point:

EDIT: Signed up for the unstract cloud free version, same issue there. It only indexes the first few characters of my document. I have checked that the llamaparse API works fine with my document.

Screenshot of the unstract cloud:

chunks used button:

Dec 30 '24 09:12 Seth-Peters

@Seth-Peters could you try with the llmwhisperer free version once to confirm if this issue is happening only with the llamaparse?

Dec 31 '24 06:12 ritwik-g

@ritwik-g - it works with the LLM whisperer. Not sure what is happening, as I did check the document itself works in my llamaparse playground (with my account/api key there).

Dec 31 '24 06:12 Seth-Peters

Would love more feedback on this specific issue. I'm currently running into the same problem. It looks like there is some processing with page Seperators from the result from LlamaParse (i.e. "---")

Jul 10 '25 15:07 ghopkins-lurin