haystack Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run`

This is to fix #2967

Related Issues

fixes #2967

Proposed Changes:

There is a bug (see #2967), which makes the RayPipeline fail for more complicated pipelines, like the BM25Retriever + EmbeddingRetriever + Reader pipeline. To fix this - after a discussion with @ZanSara (see https://app.slack.com/client/T01J6PLALBV/C01J3TZM9HT), we agreed that we should refactor the RayPipeline.run by basically removing it and making the Pipeline.run to be able to work with calls from the RayPipeline. This is what this PR entails.

How did you test it?

I have added a new unit test for the scenario when there are more than one Retrievers in front of the Reader in the RayPipeline. I have also tested whether this fixes #2967, and it does.

Notes for the reviewer

N/A

Checklist

[x] I have read the contributors guidelines and the code of conduct
[x] I have updated the related issue with new insights and changes
[x] I added tests that demonstrate the correct behavior of the change
[x] I've used the conventional commit convention for my PR title
[x] I documented my code
[x] I ran pre-commit hooks and fixed any issue

Aug 06 '22 18:08 zoltan-fedor

Hello! Thank you for this PR! By reading the description I have a couple of things to ask before going into the review.

First:

I could not find the place in the pipelines/base.py where the resulting documents of multiple retriever nodess are being joined together before being used as the input for the Reader node (eg documents=List[]), so I had to add that (see https://github.com/deepset-ai/haystack/compare/master...zoltan-fedor:haystack:feature-raypipeline-refactor-run?expand=1#diff-31b925f6ad8af5214cd1aeae3ff4f7afecf5e8a5e44a95cf458fd463fcae576fR520), including removing duplicate documents by content. If there is a method which already does this, then this part of the code might be a duplication, but I couldn't find where this might be...

So the situation with joining outputs is needlessly complicated. If you want to read a thread on the topic, I had a quite in-depth exchange with @MichelBartels a while ago exactly on this topic: https://github.com/deepset-ai/haystack/issues/2599 Have a look at the two related PRs for the actual discussion.

To summarize: on one end we have pipelines that perform document joining with no problem whatsoever:

Example 1

# This works

from pathlib import Path
from haystack import Pipeline
from haystack.nodes import FileTypeClassifier, TextConverter
from haystack.document_stores import InMemoryDocumentStore

pipe = Pipeline()

pipe.add_node(name="classifier", component=FileTypeClassifier(supported_types=["txt", "sh", "png", "yml"]), inputs=["File"])
pipe.add_node(name="txt-converter", component=TextConverter(), inputs=["classifier.output_1"])
pipe.add_node(name="sh-converter", component=TextConverter(), inputs=["classifier.output_2"])
pipe.add_node(name="yml-converter", component=TextConverter(), inputs=["classifier.output_4"])
pipe.add_node(name="docstore", component=InMemoryDocumentStore(), inputs=["txt-converter", "sh-converter", "yml-converter"])

pipe.run(file_paths=[Path("data/tutorial17/1_Dragonstone__Game_of_Thrones_episode_.txt")])

On the other, we have pipelines that can't work without JoinDocuments:

Example 2

# This fails

from haystack import Pipeline
from haystack.nodes import BM25Retriever, FARMReader, JoinDocuments
from haystack.document_stores import ElasticsearchDocumentStore

docstore = ElasticsearchDocumentStore()
pipe = Pipeline()
pipe.add_node(name="bm25_1", component=BM25Retriever(document_store=docstore), inputs=["Query"])
pipe.add_node(name="bm25_2", component=BM25Retriever(document_store=docstore), inputs=["Query"])
# pipe.add_node(name="join", component=JoinDocuments(), inputs=["bm25_2", "bm25_1"])  # UNCOMMENT TO FIX
pipe.add_node(name="reader", component=FARMReader(model_name_or_path="deepset/roberta-base-squad2"), inputs=["join"])

results = pipe.run(query="Who's the father of Arya Stark?")

for answer in results["answers"]:
    print(answer.answer)

# File "/home/sara/work/haystack/haystack/pipelines/base.py", line 509, in run
#     node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
#   File "/home/sara/work/haystack/haystack/nodes/base.py", line 201, in _dispatch_run
#     return self._dispatch_run_general(self.run, **kwargs)
#   File "/home/sara/work/haystack/haystack/nodes/base.py", line 245, in _dispatch_run_general
#     output, stream = run_method(**run_inputs, **run_params)
# TypeError: run() missing 1 required positional argument: 'documents'

However, given that there is a way to make all pipeline works, by either using JoinDocuments or not, I propose to not address this issue yet, and leave them for the soon-to-come pipeline refactoring we have planned for this year: https://github.com/deepset-ai/haystack/issues/2807 I think we should focus our efforts on fixing the bug in JoinDocuments and (for now) leave the pipeline as it is.

Once you removed the changes addressing this mismatch let me know and I will review the PR. I'm sorry to ask this, but this whole situation is better addressed with a more radical refactoring, because there is a lot of convoluted stuff going on :sweat_smile:

Second: I'm not sure the PR and the issue linked are exactly the same. This PR is more like a refactoring of RayPipeline, but does not address the async support, right? We can say that this PR is ground work for that issue, but I would like to keep the original issue open until async calls are actually supported.

Aug 08 '22 08:08 ZanSara

Hi @ZanSara,

In the end I got to the same conclusion based on the current state of the Pipeline code - namely that some pipelines do require the JoinDocuments node to join documents, while others don't - and that is just how it is for now. Based on this and your comment I have now removed that unnecessary complication about trying to make it work for all pipelines without the use of JoinDocuments. This way the PR is a lot simpler now.

Regarding your second comment - sorry for the confusion, but yes, this PR was not meant to address the not having async issue. I have linked the wrong issue to the PR. Apologies. I have fixed that now - it should have been linked to the #2967 from the beginning.

Aug 08 '22 12:08 zoltan-fedor

Thanks, let me know if anything comes up in your local testing! (hopefully not) :-)

Aug 08 '22 13:08 zoltan-fedor

@ZanSara, @tstadel , Thanks for both for the code review. I think it looks really good now. I hope it can be merged soon! Thanks!

Aug 10 '22 15:08 zoltan-fedor

haystack haystack copied to clipboard

Refactoring the `Raypipeline.run` method - merging it with the `Pipeline.run`

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

haystack
haystack copied to clipboard