haystack
haystack copied to clipboard
feat: add batch evaluation method for pipelines
Related Issue(s): closes #2636
Proposed changes:
- Add a
pipeline.eval_batch
method - Add a
_build_eval_dataframe_from_batches
method that calls_build_eval_dataframe
internally- I went for this solution to keep code duplication to a minimum.
_build_eval_dataframe
is very complex already (the method comprises >300 lines of code and would need some refactoring to be simplified)
- I went for this solution to keep code duplication to a minimum.
- Group some code in
_add_sas_to_eval_result
to avoid code duplication - Copy most eval tests into
test/pipelines/test_eval_batch.py
and make them usepipeline.eval_batch
- Add
use_batch_mode
option toexecute_eval_run
with default set toFalse
untilpipeline.eval_batch
is always faster aspipeline.eval
Limitations:
- I ran speed tests and
pipeline.eval_batch
is slower thanpipeline.eval
due to overhead in the data aggregation in the end. Generating the predictions itself (run_batch
andrun
withdebug=True
) is faster in batch mode as expected. The generated results are also correct. I intend to solve the speed issues in a separate, follow-up PR by re-implementing_build_eval_dataframe_from_batches
- I faced multiprocessing issues as discussed offline and the current workaround is setting
num_processes
ormax_processes
to 1. - ~~Up for discussion: Should
standard_pipelines.eval
andstandard_pipelines.eval_batch
have a documents parameter that they pass on?~~ We decided no, it's not needed at the moment. -
run_batch
does not support different filters (or more generally speaking any different params per query) and thuseval_batch
cannot support filters that differ per query and its label. Thus, labels must not have filters, for example test casetest_extractive_qa_labels_with_filters
won’t work with eval_batch
Currently the following tests are commented out because they are expected to fail due to other issues:
-
test_extractive_qa_eval_translation
because of https://github.com/deepset-ai/haystack/issues/2964 -
test_qa_multi_retriever_pipeline_eval
because of https://github.com/deepset-ai/haystack/issues/2962 -
test_multi_retriever_pipeline_eval
because of https://github.com/deepset-ai/haystack/issues/2962 -
test_multi_retriever_pipeline_with_asymmetric_qa_eval
because of https://github.com/deepset-ai/haystack/issues/2962
Pre-flight checklist
- [x] I have read the contributors guidelines
- [x] ~~I have enabled actions on my fork~~
- [x] If this is a code change, I added tests or updated existing ones
- [x] If this is a code change, I updated the docstrings
@agnieszka-m Thank you for the detailed feedback. All the change requests are addressed now.
- Up for discussion: Should standard_pipelines.eval and standard_pipelines.eval_batch have a documents parameter that they pass on?
I think it's fine to not have them here, as they are only needed by non-standard pipelines till now.