haystack icon indicating copy to clipboard operation
haystack copied to clipboard

RecursiveDocumentSplitter updates Document's meta field after initializing it

Open julian-risch opened this issue 6 months ago • 1 comments

Describe the bug In https://github.com/deepset-ai/haystack/blob/a28b2851d9251ad2275d344ba46d1bb8fb35932e/haystack/components/preprocessors/recursive_splitter.py#L426

Documents with the same content (and same initial meta data) will be assigned the same id in the RecursiveDocumentSplitter. As a result, the run method of the RecursiveDocumentSplitter might return documents with the same id. That looks like a bug to me too.

What could be a fix is to first create the new meta data, as in the line new_doc.meta["split_id"] = split_nr and only afterward create a new document. In addition we should add the id of the parent document. I have in mind something like:

meta=deepcopy(doc.meta)
meta["parent_id"] = doc.id
meta["split_id"] = split_nr
meta["split_idx_start"] = current_position
meta["_split_overlap"] = [] if self.split_overlap > 0 else None
new_doc = Document(content=chunk, meta=meta)

Error message None. Documents with the same id might be handled as duplicates later in a pipeline.

Expected behavior Different chunks with same content and differing meta data should have different document ids.

Additional context Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce Steps to reproduce the behavior

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number):
  • DocumentStore:
  • Reader:
  • Retriever:

julian-risch avatar Jun 12 '25 07:06 julian-risch

I am working on this issue

gulbaki avatar Jun 14 '25 12:06 gulbaki