LightRAG icon indicating copy to clipboard operation
LightRAG copied to clipboard

[Bug]: Track id is generated for duplicate documents

Open tayyabatabassum-dev opened this issue 1 month ago • 10 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [ ] I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

document/text API return track id for duplicated documents, using same track id if i call API .documents/status_track/{track_id} then got nothing, status can not be tracked on this id.

Steps to reproduce

No response

Expected Behavior

No response

LightRAG Config Used

Paste your config here

Logs and screenshots

No response

Additional Information

  • LightRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

tayyabatabassum-dev avatar Dec 02 '25 05:12 tayyabatabassum-dev

Please provide the endpoint(s) where the invalid tracking ID was encountered.

danielaskdd avatar Dec 02 '25 05:12 danielaskdd

GET : /documents/track_status/{track_id} passing track id in the api.

tayyabatabassum-dev avatar Dec 02 '25 05:12 tayyabatabassum-dev

From which endpoint was the tracking ID retrieved? What's "track id for duplicated documents" means, under what circumstances was the tracking ID obtained?

danielaskdd avatar Dec 02 '25 05:12 danielaskdd

Scenario:

First call

  • I send a POST request to /document/text with some text and a unique document name.

  • The API returns a track_id, which I then pass to GET /documents/track_status/{track_id}.

  • This returns the correct status for the document.

Second call (duplicate text)

  • I send the same text again, but with a new unique document name.

  • The API again returns a track_id for this request.

  • However, when I pass this new track_id to

  • GET /documents/track_status/{track_id},

  • the response is empty.

tayyabatabassum-dev avatar Dec 02 '25 06:12 tayyabatabassum-dev

If a file is duplicated, the current implementation returns an empty track_id instead of the track_id from the existing file. Would it be more reasonable to return the track_id of the existing file?

danielaskdd avatar Dec 02 '25 07:12 danielaskdd

It should be the case. However, if the text is exactly the same but the file source is different, I am getting separate IDs from both requests: the first one is traceable, but the second one is not. below SS is for second track id Image

tayyabatabassum-dev avatar Dec 02 '25 07:12 tayyabatabassum-dev

PR #2469 addressed this issue. Pls pull the latest code and check if it works as expected.

danielaskdd avatar Dec 02 '25 09:12 danielaskdd

@danielaskdd - I tested with the updated code and it does return a tracking ID, but the tracking ID doesn't contain any useful information.

Flow: Upload Document A -> Get tracking_id_A, status: 'success' Upload Document B (same text content, but different file (has embedded audio)) -> Get tracking_id_B, status: 'success'

Document A is moved to the data storage enqueued folder Document B is also moved to the data storage __enqueued_folder tracking_id_B is not the same as tracking_id_A, when used, it returns looking exactly like @tayyabatabassum-dev last uploaded image

Reupload Document A -> Get tracking_id_A (same as first upload), status: 'duplicated' Reupload Document B -> Get tracking_id_B_prime - doesn't match any existing tracking id's, status: 'success'

Document A is not moved to storage again, and the status of duplicated seems to be fine Document B however is copied to storage again and given the _001 ending.

The issue I think @tayyabatabassum-dev and I are facing, is that we want the Document B to respond with a 'duplicated' response, possibly with the doc_id, or the tracking_id, of the document we are duplicating. For me, I'd log that its a duplicate with the id of what its a duplicate of, so I can either check the docs myself, or something - but right now, it looks like Document B gets uploaded, and then the tracking_id just points to nothing.

Justinius avatar Dec 08 '25 18:12 Justinius

from the logger:

WARNING: Ignoring document ID (already exists): doc-54aa5a92238da2eab6e3429dafdff91b (Document B Filename) INFO: Successfully extracted and enqueued file: Document B Filename DEBUG: Moved file to enqueued directory: Document B Filename -> Document B Filename

Justinius avatar Dec 08 '25 19:12 Justinius

@tayyabatabassum-dev Pls pull the latest code and check if it works as expected.

danielaskdd avatar Dec 12 '25 03:12 danielaskdd