haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: add batch evaluation method for pipelines

Open julian-risch opened this issue 2 years ago • 2 comments

Related Issue(s): closes #2636

Proposed changes:

  • Add a pipeline.eval_batch method
  • Add a _build_eval_dataframe_from_batches method that calls _build_eval_dataframe internally
    • I went for this solution to keep code duplication to a minimum. _build_eval_dataframe is very complex already (the method comprises >300 lines of code and would need some refactoring to be simplified)
  • Group some code in _add_sas_to_eval_result to avoid code duplication
  • Copy most eval tests into test/pipelines/test_eval_batch.py and make them use pipeline.eval_batch
  • Add use_batch_mode option to execute_eval_run with default set to False until pipeline.eval_batch is always faster as pipeline.eval

Limitations:

  • I ran speed tests and pipeline.eval_batch is slower than pipeline.eval due to overhead in the data aggregation in the end. Generating the predictions itself (run_batch and run with debug=True) is faster in batch mode as expected. The generated results are also correct. I intend to solve the speed issues in a separate, follow-up PR by re-implementing _build_eval_dataframe_from_batches
  • I faced multiprocessing issues as discussed offline and the current workaround is setting num_processes or max_processes to 1.
  • ~~Up for discussion: Should standard_pipelines.eval and standard_pipelines.eval_batch have a documents parameter that they pass on?~~ We decided no, it's not needed at the moment.
  • run_batch does not support different filters (or more generally speaking any different params per query) and thus eval_batch cannot support filters that differ per query and its label. Thus, labels must not have filters, for example test case test_extractive_qa_labels_with_filters won’t work with eval_batch

Currently the following tests are commented out because they are expected to fail due to other issues:

  • test_extractive_qa_eval_translation because of https://github.com/deepset-ai/haystack/issues/2964
  • test_qa_multi_retriever_pipeline_eval because of https://github.com/deepset-ai/haystack/issues/2962
  • test_multi_retriever_pipeline_eval because of https://github.com/deepset-ai/haystack/issues/2962
  • test_multi_retriever_pipeline_with_asymmetric_qa_eval because of https://github.com/deepset-ai/haystack/issues/2962

Pre-flight checklist

julian-risch avatar Aug 01 '22 16:08 julian-risch

@agnieszka-m Thank you for the detailed feedback. All the change requests are addressed now.

julian-risch avatar Aug 09 '22 13:08 julian-risch

  • Up for discussion: Should standard_pipelines.eval and standard_pipelines.eval_batch have a documents parameter that they pass on?

I think it's fine to not have them here, as they are only needed by non-standard pipelines till now.

tstadel avatar Aug 09 '22 15:08 tstadel