haystack
haystack copied to clipboard
Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run`
This is to fix #2967
Related Issues
- fixes #2967
Proposed Changes:
There is a bug (see #2967), which makes the RayPipeline
fail for more complicated pipelines, like the BM25Retriever + EmbeddingRetriever + Reader pipeline.
To fix this - after a discussion with @ZanSara (see https://app.slack.com/client/T01J6PLALBV/C01J3TZM9HT), we agreed that we should refactor the RayPipeline.run
by basically removing it and making the Pipeline.run
to be able to work with calls from the RayPipeline
. This is what this PR entails.
How did you test it?
I have added a new unit test for the scenario when there are more than one Retrievers in front of the Reader in the RayPipeline. I have also tested whether this fixes #2967, and it does.
Notes for the reviewer
N/A
Checklist
- [x] I have read the contributors guidelines and the code of conduct
- [x] I have updated the related issue with new insights and changes
- [x] I added tests that demonstrate the correct behavior of the change
- [x] I've used the conventional commit convention for my PR title
- [x] I documented my code
- [x] I ran pre-commit hooks and fixed any issue
Hello! Thank you for this PR! By reading the description I have a couple of things to ask before going into the review.
First:
I could not find the place in the pipelines/base.py where the resulting documents of multiple retriever nodess are being joined together before being used as the input for the Reader node (eg documents=List[]), so I had to add that (see https://github.com/deepset-ai/haystack/compare/master...zoltan-fedor:haystack:feature-raypipeline-refactor-run?expand=1#diff-31b925f6ad8af5214cd1aeae3ff4f7afecf5e8a5e44a95cf458fd463fcae576fR520), including removing duplicate documents by content. If there is a method which already does this, then this part of the code might be a duplication, but I couldn't find where this might be...
So the situation with joining outputs is needlessly complicated. If you want to read a thread on the topic, I had a quite in-depth exchange with @MichelBartels a while ago exactly on this topic: https://github.com/deepset-ai/haystack/issues/2599 Have a look at the two related PRs for the actual discussion.
To summarize: on one end we have pipelines that perform document joining with no problem whatsoever:
Example 1
# This works
from pathlib import Path
from haystack import Pipeline
from haystack.nodes import FileTypeClassifier, TextConverter
from haystack.document_stores import InMemoryDocumentStore
pipe = Pipeline()
pipe.add_node(name="classifier", component=FileTypeClassifier(supported_types=["txt", "sh", "png", "yml"]), inputs=["File"])
pipe.add_node(name="txt-converter", component=TextConverter(), inputs=["classifier.output_1"])
pipe.add_node(name="sh-converter", component=TextConverter(), inputs=["classifier.output_2"])
pipe.add_node(name="yml-converter", component=TextConverter(), inputs=["classifier.output_4"])
pipe.add_node(name="docstore", component=InMemoryDocumentStore(), inputs=["txt-converter", "sh-converter", "yml-converter"])
pipe.run(file_paths=[Path("data/tutorial17/1_Dragonstone__Game_of_Thrones_episode_.txt")])
On the other, we have pipelines that can't work without JoinDocuments
:
Example 2
# This fails
from haystack import Pipeline
from haystack.nodes import BM25Retriever, FARMReader, JoinDocuments
from haystack.document_stores import ElasticsearchDocumentStore
docstore = ElasticsearchDocumentStore()
pipe = Pipeline()
pipe.add_node(name="bm25_1", component=BM25Retriever(document_store=docstore), inputs=["Query"])
pipe.add_node(name="bm25_2", component=BM25Retriever(document_store=docstore), inputs=["Query"])
# pipe.add_node(name="join", component=JoinDocuments(), inputs=["bm25_2", "bm25_1"]) # UNCOMMENT TO FIX
pipe.add_node(name="reader", component=FARMReader(model_name_or_path="deepset/roberta-base-squad2"), inputs=["join"])
results = pipe.run(query="Who's the father of Arya Stark?")
for answer in results["answers"]:
print(answer.answer)
# File "/home/sara/work/haystack/haystack/pipelines/base.py", line 509, in run
# node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
# File "/home/sara/work/haystack/haystack/nodes/base.py", line 201, in _dispatch_run
# return self._dispatch_run_general(self.run, **kwargs)
# File "/home/sara/work/haystack/haystack/nodes/base.py", line 245, in _dispatch_run_general
# output, stream = run_method(**run_inputs, **run_params)
# TypeError: run() missing 1 required positional argument: 'documents'
However, given that there is a way to make all pipeline works, by either using JoinDocuments
or not, I propose to not address this issue yet, and leave them for the soon-to-come pipeline refactoring we have planned for this year: https://github.com/deepset-ai/haystack/issues/2807 I think we should focus our efforts on fixing the bug in JoinDocuments
and (for now) leave the pipeline as it is.
Once you removed the changes addressing this mismatch let me know and I will review the PR. I'm sorry to ask this, but this whole situation is better addressed with a more radical refactoring, because there is a lot of convoluted stuff going on :sweat_smile:
Second: I'm not sure the PR and the issue linked are exactly the same. This PR is more like a refactoring of RayPipeline
, but does not address the async
support, right? We can say that this PR is ground work for that issue, but I would like to keep the original issue open until async
calls are actually supported.
Hi @ZanSara,
In the end I got to the same conclusion based on the current state of the Pipeline code - namely that some pipelines do require the JoinDocuments
node to join documents, while others don't - and that is just how it is for now.
Based on this and your comment I have now removed that unnecessary complication about trying to make it work for all pipelines without the use of JoinDocuments
. This way the PR is a lot simpler now.
Regarding your second comment - sorry for the confusion, but yes, this PR was not meant to address the not having async
issue. I have linked the wrong issue to the PR. Apologies.
I have fixed that now - it should have been linked to the #2967 from the beginning.
Thanks, let me know if anything comes up in your local testing! (hopefully not) :-)
@ZanSara, @tstadel , Thanks for both for the code review. I think it looks really good now. I hope it can be merged soon! Thanks!