id_hash_keys is ignored
Describe the bug
Passing id_hash_keys to an ingestion pipeline or individual nodes has no effect. For example given id_hash_keys = ["meta"] and two documents with different meta data, both will share the same id if the content is the same.
Error message No error
Expected behavior
Generated document ids should be formed using a hash of the fields passed through the id_hash_keys parameter
Additional context
During debugging I've noticed that the Document constructor ignores the id_hash_keys parameter that is passed by the node e.g. a TextConverter node. This is possibly due to the issue with mutable default parameters in Python.
For example try this code:
# Different metadata
doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)
doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id)
Output:
# Same generated ids
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e
However if we explicitly pass all constructor arguments:
doc3 = Document("hello world", "text", None, None, {'doc_id': '3'}, None, ['meta'])
print(doc3.id)
doc4 = Document("hello world", "text", None, None, {'doc_id': '4'}, None, ['meta'])
print(doc4.id)
Output:
# Different generated ids
ee6af6c5a7999fae9b46a6c257fd957e
324d51b0738f74b683f7d69cbf2a58ef
Different ids are generated
To Reproduce See above or try this code
def ingest(meta_id):
doc_store = InMemoryDocumentStore()
# Pass id_hash_keys to the TextConverter
text_converter = TextConverter(valid_languages=['en'], progress_bar=False, id_hash_keys=["meta", "content"])
# Pass id_hash_keys to the PreProcessor
pre_processor = PreProcessor(progress_bar=False, id_hash_keys=["meta", "content"])
p = Pipeline()
p.add_node(component=text_converter, name="TextConverter", inputs=["File"])
p.add_node(component=pre_processor, name="PreProcessor", inputs=["TextConverter"])
p.add_node(component=doc_store, name="DocumentStore", inputs=["PreProcessor"])
# Pass id_hash_keys to the pipeline
params = {
"debug": True,
"id_hash_keys": ["meta", "content"]
}
meta = {
"meta_id": meta_id
}
result = p.run(file_paths=["./test_doc.txt"], meta=meta, params=params)
# Final generated id
doc = result['documents'][0]
# TextConverter generated id
debug = result["_debug"]['TextConverter']
text_converter_id = debug['output']['documents'][0].id
print("Meta id".ljust(20), meta_id)
print("TextConverter id".ljust(20), text_converter_id)
print("Generated id".ljust(20), doc.id)
if __name__ == "__main__":
ingest("1")
print("")
ingest("2")
Output:
Meta id 1
TextConverter id 3eddb872a053055e226cb754159e098e
Generated id 3eddb872a053055e226cb754159e098e
Meta id 2
TextConverter id 3eddb872a053055e226cb754159e098e
Generated id 3eddb872a053055e226cb754159e098e
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS: macOS Monterey
- GPU/CPU: CPU
- Haystack version (commit or version number): 1.8.0
- DocumentStore: N/A
- Reader: N/A
- Retriever: N/A
A bit more digging reveals this code in the PreProcessor split function (line 431 in preprocessor.py):
documents = []
for i, txt in enumerate(text_splits):
# I think this is the problem
doc = Document(content=txt, meta=deepcopy(document.meta) or {}, id_hash_keys=id_hash_keys)
doc.meta["_split_id"] = I
if self.add_page_number:
doc.meta["page"] = splits_pages[I]
documents.append(doc)
return documents
The id_hash_keys argument is lost
I made some experiments with Haystack 1.8.0 on Ubuntu 22.04 and Colab (=Ubuntu 18.04). Python version: 3.7 You can find my notebook here.
Test on document instantiation:
doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)
# 401d5da16fd3fc8a37752172a883f2dc
doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id)
# 347767162caba0d0869dcb013bcb572a
Ingestion test:
Meta id 1
TextConverter id 70f712e8a8c3d3017f11ccc5db02a8ea
Generated id 729fa6a177edd9f3ba9a55589bbda110
Meta id 2
TextConverter id c2de7093742566ddff0398d8b113ac25
Generated id 866818724f2411a7f7c5680ad44870ec
@thobson My tests show different results than yours. Are we sure we are using the same version? Any other ideas?
@anakin87 I've checked the haystack version and I'm definitely on 1.8.0. I also tried the same code on Colab and got the same results as you. I'm thinking this may be related to the Python version. My broken example used 3.9.9 on MacOS. I'll try some other versions and report back ...
This is very strange. The only place this is working for me is collab. I just fired up an ec2 instance with Python 3.10.4 and farm-haystack==1.8.0.
# bug.py
from haystack.schema import Document
doc1 = Document(content="hello world", meta={'doc_id': '1'}, id_hash_keys=['meta'])
print(doc1.id)
doc2 = Document(content="hello world", meta={'doc_id': '2'}, id_hash_keys=['meta'])
print(doc2.id)
$ python3 --version
Python 3.10.4
$ pip freeze | grep haystack
farm-haystack==1.8.0
$ python3 bug.py
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e
I'll keep investigating
Confirmed on my side
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 sw_vers
ProductName: macOS
ProductVersion: 12.5.1
BuildVersion: 21G83
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 python -V
Python 3.10.6
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 pip freeze | grep haystack
farm-haystack==1.8.0
(tmp-4d563e993a01222) ~/.virtualenvs/tmp-4d563e993a01222 python bug.py
ab97467d60eb63b1533f6046eb7f610e
ab97467d60eb63b1533f6046eb7f610e
I tried to reproduce this issue with different Haystack versions and python version. The problem did occur with Haystack v1.8.0 but not with any version after that.
| Python/Haystack Version | 1.8.0 | 1.9.0 | 1.10.0 |
|---|---|---|---|
| 3.10 | ❌ | ✅ | ✅ |
| 3.9 | ❌ | ✅ | ✅ |
| 3.8 | ❌ | ✅ | ✅ |
For that reason, I only added a test case to check for this issue in future in this PR https://github.com/deepset-ai/haystack/pull/3577 and suggest to close this issue once that PR is merged. Please let me know if you have this issue with Haystack >v1.8.0. Thank you!
Confirmed this is working in 1.10.0. Thanks everyone!