RecursiveDocumentSplitter updates Document's meta field after initializing it
Describe the bug In https://github.com/deepset-ai/haystack/blob/a28b2851d9251ad2275d344ba46d1bb8fb35932e/haystack/components/preprocessors/recursive_splitter.py#L426
Documents with the same content (and same initial meta data) will be assigned the same id in the RecursiveDocumentSplitter. As a result, the run method of the RecursiveDocumentSplitter might return documents with the same id. That looks like a bug to me too.
What could be a fix is to first create the new meta data, as in the line new_doc.meta["split_id"] = split_nr and only afterward create a new document. In addition we should add the id of the parent document. I have in mind something like:
meta=deepcopy(doc.meta)
meta["parent_id"] = doc.id
meta["split_id"] = split_nr
meta["split_idx_start"] = current_position
meta["_split_overlap"] = [] if self.split_overlap > 0 else None
new_doc = Document(content=chunk, meta=meta)
Error message None. Documents with the same id might be handled as duplicates later in a pipeline.
Expected behavior Different chunks with same content and differing meta data should have different document ids.
Additional context Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce Steps to reproduce the behavior
FAQ Check
- [ ] Have you had a look at our new FAQ page?
System:
- OS:
- GPU/CPU:
- Haystack version (commit or version number):
- DocumentStore:
- Reader:
- Retriever:
I am working on this issue