haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Handle Document ids consistently and enable custom `id_hash_keys`

Open julian-risch opened this issue 6 months ago • 0 comments

We discussed that ids are not handled consistently in Haystack in situations where meta data is updated by a component, for example LLMMetadataExtractor. We discussed this PR and its implications with @tstadel @sjrl @ju-gu .

We agreed that components that don’t change the content should not generate a new id with the exception of DocumentCleaner which has a keep_id parameter with the default value false. In other words, if only the meta data of documents is updated by a component, the document ids should remain unchanged in the output.

For enabling more customization in how ids are generated for newly initialized documents, we agreed that there are three options

  • Adding id_hash_keys to all converters
  • Adding a new component just before the DocumentWriter
  • Adding a new parameter to the DocumentWriter that enables generating new document ids based on id_hash_keys

Third option is preferred. In addition, we discussed that we should not use the embedding field for document id generation but we're currently use it here https://github.com/deepset-ai/haystack/blob/main/haystack/dataclasses/document.py#L117

julian-risch avatar Jun 27 '25 10:06 julian-risch