can-ai-code
can-ai-code copied to clipboard
Guide on how to evaluate models
Im willing to test a few models and share the results. I've looked at the readme, but couldn't wrap my head around how to benchmark a model. Any help would be appriciated!
The docs definitely need a rewrite my apologies here.
The general flow is:
- prepare.py
- interview*.py
- eval.py
In the dark days we had to deal with dozens of prompt formats, but these days prepare.py can be run with --chat hfmodel and it will sort it out.
Note that there are two interviews junior-v2 and senior, I usually only run senior on strong models that get >90% on junior.