simple-evals
simple-evals copied to clipboard
Run benchmarks for old GPT-4 models (GPT-4-0314 and GPT-4-0613) and all GPT-3.5-turbo models
Zero-shot scores for those models are not easily googleable — so this would be very useful for looking at the improvement trend over time!