LocalAIME

This simple tool tests local (or not) LLMs on the AIME problems. Even if some models are specifically trained to solve AIME-style problems or even trained specifically on some of them (by accident or purpose), it is still useful for comparing models of the same family or different quantizations of the same exact model. It would also be interesting to test same model, same quantization, but from different sources on huggingfcace.

Example results

Setup

First of all prepare the project for the first test:

git clone https://github.com/Belluxx/LocalAIME.git
cd LocalAIME
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install -r requirements.txt

Run benchmark

Now you are ready to test a model on AIME 2024. Be sure to match both the --base-url and --model identifier based on which platform and which exact model you are using.

Ollama

python3 src/main.py \
    --base-url 'http://127.0.0.1:11434/v1' \
    --model 'gemma3:4b' \
    --max-tokens 32000 \
    --timeout 2000 \
    --problem-tries 3

LMStudio

python3 src/main.py \
    --base-url 'http://127.0.0.1:1234/v1' \
    --model 'gemma-3-4b-it-qat' \
    --max-tokens 32000 \
    --timeout 2000 \
    --problem-tries 3

Llama.cpp

Start the llama-server (be sure to use optimal temp, top-k, top-p, min-p from the model provider):

llama-server \
    -m /Absolute/path/to/my_model.gguf \
    --mlock \
    --n-gpu-layers -1 \
    --ctx-size 31000 \
    --port 8080 \
    --temp 0.7 \
    --top-k 20 \
    --top-p 0.8 \
    --min-p 0.0

Then run the benchmark:

python3 src/main.py \
    --base-url 'http://127.0.0.1:8080/v1' \
    --model 'my-model' \
    --max-tokens 30000 \
    --timeout 2000 \
    --problem-tries 3

See results

After the test is finished, you can open the generated model-name.json file and check the results.

If you test many models you can also put all of them in a directory (eg. results/) and plot the results to get an overview:

python3 src/plot.py results

Then check the plots inside plots/

Credits

AIME 2024 problems dataset retrieved from HuggingFaceH4

LocalAIME
LocalAIME copied to clipboard

Metadata

LocalAIME

Example results

Setup

Run benchmark

Ollama

LMStudio

Llama.cpp

See results

Credits

← Metadata

Owner

Metadata

LocalAIME LocalAIME copied to clipboard

Metadata

LocalAIME

Example results

Setup

Run benchmark

Ollama

LMStudio

Llama.cpp

See results

Credits

← Metadata

Owner

Metadata

LocalAIME
LocalAIME copied to clipboard