Joel Niklaus
Joel Niklaus
## Issue encountered When running large evals with many dataset configurations it is very painful to rerun everything in case something fails. ## Solution/Feature It would be great if intermediate...
## Issue encountered Rerunning evaluations with LLM as judge metrics can be expensive and time-consuming. ## Solution/Feature Adding a diskcache to the JudgeLM class could solve this.
## Issue encountered When evaluating large models, significant costs and delays can occur for inference, especially on larger datasets. Possibly I want to re-evaluate my predictions using different metrics. ##...
## Issue encountered Currently, inference of open models on my Mac device is quite slow since vllm does not support mps. ## Solution/Feature Llama.cpp does support mps and would significantly...
HELM currently only evaluates on 5 LegalBench tasks. Ideally, we would like to be able to run evaluation on all tasks. I quickly analyzed the structure of the tasks and...