autolabel
autolabel copied to clipboard
Support for evaluating multiple LLMs in parallel with same task & dataset configuration
This is often helpful as a preliminary first step to understand which LLM does the best on a user's specific task and data
Would be really neat if we are able to create a report for a user's dataset, similar to the benchmark we put together here: https://www.refuel.ai/blog-posts/llm-labeling-technical-report
Would this be another function (maybe test_llms(list_of_llms) that we could add to LabelingAgent or would this be a config change where instead of a single model config, we could have a list of model configs to benchmark?