haystack feat: add batch evaluation method for pipelines

Related Issue(s): closes #2636

Proposed changes:

Add a pipeline.eval_batch method
Add a _build_eval_dataframe_from_batches method that calls _build_eval_dataframe internally
- I went for this solution to keep code duplication to a minimum. _build_eval_dataframe is very complex already (the method comprises >300 lines of code and would need some refactoring to be simplified)
Group some code in _add_sas_to_eval_result to avoid code duplication
Copy most eval tests into test/pipelines/test_eval_batch.py and make them use pipeline.eval_batch
Add use_batch_mode option to execute_eval_run with default set to False until pipeline.eval_batch is always faster as pipeline.eval

Limitations:

I ran speed tests and pipeline.eval_batch is slower than pipeline.eval due to overhead in the data aggregation in the end. Generating the predictions itself (run_batch and run with debug=True) is faster in batch mode as expected. The generated results are also correct. I intend to solve the speed issues in a separate, follow-up PR by re-implementing _build_eval_dataframe_from_batches
I faced multiprocessing issues as discussed offline and the current workaround is setting num_processes or max_processes to 1.
~~Up for discussion: Should standard_pipelines.eval and standard_pipelines.eval_batch have a documents parameter that they pass on?~~ We decided no, it's not needed at the moment.
run_batch does not support different filters (or more generally speaking any different params per query) and thus eval_batch cannot support filters that differ per query and its label. Thus, labels must not have filters, for example test case test_extractive_qa_labels_with_filters won’t work with eval_batch

Currently the following tests are commented out because they are expected to fail due to other issues:

test_extractive_qa_eval_translation because of https://github.com/deepset-ai/haystack/issues/2964
test_qa_multi_retriever_pipeline_eval because of https://github.com/deepset-ai/haystack/issues/2962
test_multi_retriever_pipeline_eval because of https://github.com/deepset-ai/haystack/issues/2962
test_multi_retriever_pipeline_with_asymmetric_qa_eval because of https://github.com/deepset-ai/haystack/issues/2962

Pre-flight checklist

Aug 01 '22 16:08 julian-risch

@agnieszka-m Thank you for the detailed feedback. All the change requests are addressed now.

Aug 09 '22 13:08 julian-risch

Up for discussion: Should standard_pipelines.eval and standard_pipelines.eval_batch have a documents parameter that they pass on?

I think it's fine to not have them here, as they are only needed by non-standard pipelines till now.

Aug 09 '22 15:08 tstadel