ragflow [Bug]: Chunk Data Not Updated Immediately After API Call

When updating a chunk via the API and then immediately calling list_chunks, the retrieved chunk data reflects the state before the update.

doc.list_chunks()[0].to_json()
doc.list_chunks()[0].update({'content':'new content'})
doc.list_chunks()[0].to_json()
# not new content

May 22 '25 02:05 mumumi

The behavior you're seeing—where updating a chunk via the API and then immediately calling list_chunks does not reflect the updated data—is a known characteristic of the current implementation.

Here's what is happening:

The chunk update is performed via a PUT request to the API, which updates the chunk data on the server side. This is handled by the update_chunk endpoint and related backend logic, which updates the chunk in the document store and processes the new content and metadata as needed (source, source).
The list_chunk endpoint retrieves chunk data using a search function. However, there is evidence from the test suite that updates may not be reflected immediately. Specifically, tests use a sleep function to wait for updates to propagate before asserting that the new data is visible (source).

This suggests that there may be asynchronous processing or caching involved, causing a delay between when the update is made and when it is visible via list_chunks.

Workaround:
To ensure you see the updated chunk data, you may need to introduce a short delay (e.g., using time.sleep in Python) after the update call before retrieving the chunk data again. This is consistent with the approach used in the project's own tests.

If you need to guarantee immediate consistency, you may need to implement a retry mechanism that checks for the updated content until it appears, or wait for a short period before fetching the updated data.

I did not find explicit documentation or code comments explaining the exact cause of the delay (such as a specific caching layer), but the use of sleep in tests strongly indicates that this is expected behavior in the current system.

If you need more details about the backend processing or want to discuss possible improvements, let me know!

To continue talking to Dosu, mention @dosu-bot.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

May 22 '25 02:05 dosubot[bot]

As I know the default kb_conn ES will not affect the write change immediately, https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-refresh-1

May 23 '25 01:05 Woody-Hu

@mumumi As I know this is an asynchronous operation. You'll need to check the status manually using the list_chunks method.

May 26 '25 06:05 asiroliu

@mumumi As I know this is an asynchronous operation. You'll need to check the status manually using the list_chunks method.

I know that “async_parse” and “async_cancel_parse” are asynchronous operations. It’s counterintuitive that the “update” operation is asynchronous. Is the “update” operation for the dataset also asynchronous? Are there any other operations that are asynchronous?

May 27 '25 03:05 mumumi

@mumumi As I know this is an asynchronous operation. You'll need to check the status manually using the list_chunks method.

I know that “async_parse” and “async_cancel_parse” are asynchronous operations. It’s counterintuitive that the “update” operation is asynchronous. Is the “update” operation for the dataset also asynchronous? Are there any other operations that are asynchronous?

Yes, the "update" to ES may not show fresh data immediately https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-refresh-1

May 27 '25 05:05 Woody-Hu