[Bug]: Track id is generated for duplicate documents
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [ ] I believe this is a legitimate bug, not just a question or feature request.
Describe the bug
document/text API return track id for duplicated documents, using same track id if i call API .documents/status_track/{track_id} then got nothing, status can not be tracked on this id.
Steps to reproduce
No response
Expected Behavior
No response
LightRAG Config Used
Paste your config here
Logs and screenshots
No response
Additional Information
- LightRAG Version:
- Operating System:
- Python Version:
- Related Issues:
Please provide the endpoint(s) where the invalid tracking ID was encountered.
GET : /documents/track_status/{track_id} passing track id in the api.
From which endpoint was the tracking ID retrieved? What's "track id for duplicated documents" means, under what circumstances was the tracking ID obtained?
Scenario:
First call
-
I send a POST request to /document/text with some text and a unique document name.
-
The API returns a track_id, which I then pass to GET /documents/track_status/{track_id}.
-
This returns the correct status for the document.
Second call (duplicate text)
-
I send the same text again, but with a new unique document name.
-
The API again returns a track_id for this request.
-
However, when I pass this new track_id to
-
GET /documents/track_status/{track_id},
-
the response is empty.
If a file is duplicated, the current implementation returns an empty track_id instead of the track_id from the existing file. Would it be more reasonable to return the track_id of the existing file?
It should be the case. However, if the text is exactly the same but the file source is different, I am getting separate IDs from both requests: the first one is traceable, but the second one is not.
below SS is for second track id
PR #2469 addressed this issue. Pls pull the latest code and check if it works as expected.
@danielaskdd - I tested with the updated code and it does return a tracking ID, but the tracking ID doesn't contain any useful information.
Flow: Upload Document A -> Get tracking_id_A, status: 'success' Upload Document B (same text content, but different file (has embedded audio)) -> Get tracking_id_B, status: 'success'
Document A is moved to the data storage enqueued folder Document B is also moved to the data storage __enqueued_folder tracking_id_B is not the same as tracking_id_A, when used, it returns looking exactly like @tayyabatabassum-dev last uploaded image
Reupload Document A -> Get tracking_id_A (same as first upload), status: 'duplicated' Reupload Document B -> Get tracking_id_B_prime - doesn't match any existing tracking id's, status: 'success'
Document A is not moved to storage again, and the status of duplicated seems to be fine Document B however is copied to storage again and given the _001 ending.
The issue I think @tayyabatabassum-dev and I are facing, is that we want the Document B to respond with a 'duplicated' response, possibly with the doc_id, or the tracking_id, of the document we are duplicating. For me, I'd log that its a duplicate with the id of what its a duplicate of, so I can either check the docs myself, or something - but right now, it looks like Document B gets uploaded, and then the tracking_id just points to nothing.
from the logger:
WARNING: Ignoring document ID (already exists): doc-54aa5a92238da2eab6e3429dafdff91b (Document B Filename) INFO: Successfully extracted and enqueued file: Document B Filename DEBUG: Moved file to enqueued directory: Document B Filename -> Document B Filename
@tayyabatabassum-dev Pls pull the latest code and check if it works as expected.