llm.c Generation error on MPS (Torch >= 2.2.0, MacOS 14.4)

When running train_gpt2.py, I get all 16 output tokens equal to "!" (token 0).

Here is the complete output:

❯ python3 train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/tiny_shakespeare_val.bin
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 5.2700090408325195
iteration 1, loss: 4.059708118438721
iteration 2, loss: 3.375123977661133
iteration 3, loss: 2.800778388977051
iteration 4, loss: 2.315387725830078
iteration 5, loss: 1.8490203619003296
iteration 6, loss: 1.3946478366851807
iteration 7, loss: 0.999144434928894
iteration 8, loss: 0.624073326587677
iteration 9, loss: 0.37650370597839355
<|endoftext|>!!!!!!!!!!!!!!!!
---------------

I am currently using Python 3.12 and PyTorch 2.2.2 on MPS on an M1 Pro MacBook running MacOS 14.4.1. I have tried it with Python 3.9 and 3.11 as well (all on Torch 2.2.2), and all have the same issue. When downgrading to torch 2.1.X or using the CPU, this does not happen. I had the same issue with the original NanoGPT implementation (see karpathy/nanoGPT#458).

I also tried to test this on a friend's M1 Pro MacBook using the following script, and the output is the same.

#!/bin/bash

cd "$HOME" || exit 0
OUT_FNAME="$HOME/results.txt"

python3 --version | tee "$OUT_FNAME"

git clone https://github.com/karpathy/llm.c.git
cd llm.c || exit 1
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

pip3 list | tee -a "$OUT_FNAME"

python3 prepro_tinyshakespeare.py
python3 train_gpt2.py | tee -a "$OUT_FNAME"

rm -rf "$HOME/llm.c"
if [ -z "$XDG_CACHE_DIR" ]; then
    rm -rf "$HOME/.cache/huggingface/hub/models--gpt2"
else
    rm -rf "$XDG_CACHE_DIR/huggingface/hub/models--gpt2"
fi

echo "Done! Output file stored in $OUT_FNAME"

After some debugging, I narrowed it down to a bug using "advanced indexing" when setting the non-top-k logits to -inf in train_gpt2.py:203:

logits[logits < v[:, [-1]]] = -float('Inf')

Indeed, after that line, the logits tensor only contains -inf, which makes the softmax values nan, and sampling those always yields token 0.

I was also able to reproduce the issue using the Python interpreter to modify values in a tensor using conditional indexing.

Since I haven't seen anyone with this same problem, I want to make sure I am not the only one having it, so let me know if anyone did.

Apr 09 '24 21:04 davmacario

I also tried to replace the assignment with the following:

logits = torch.where(logits >= v[:, [-1]], logits, -float("Inf"))

and again, it works on CPU but not on MPS (even though I suspect it's actually using the same code underneath).

Apr 09 '24 21:04 davmacario

Indeed, I encountered the same issue on PyTorch versions 2.2.0 and 2.2.2.

Apr 11 '24 21:04 acse-hy23

Indeed, I encountered the same issue on PyTorch versions 2.2.0 and 2.2.2.

Can I ask which CPU you have? Thanks!

Apr 11 '24 21:04 davmacario

Indeed, I encountered the same issue on PyTorch versions 2.2.0 and 2.2.2.

Can I ask which CPU you have? Thanks!

It's M2 Pro with 16GB RAM

Apr 11 '24 22:04 acse-hy23

This is so random! I just tested the same exact scenario on my Mac, but on a fresh MacOS 14.4 installation I have on an external drive for testing, and the problem does not appear. Still using Python 3.12, 3.10 and 3.9 (first two installed via Homebrew, the latter is the system install that comes with the command line tools) and Torch 2.2.2

I am confused.

Apr 13 '24 15:04 davmacario

i got bus error on python 3.10 and torch 2.0.1, upgraded to torch 2.2.2 and ran with no issue

Apr 17 '24 14:04 longsco

This is NOT a bug, it is expected. I can reproduce the result with following setup on Mac Air M2:

-- sw_vers ProductName: macOS ProductVersion: 14.4.1 BuildVersion: 23E224

-- python -V Python 3.12.2

-- pytorch 2.2.2 installed with conda create --name=pytorch python=3.12 conda activate pytorch
conda install pytorch::pytorch torchvision torchaudio -c pytorch

-- the result is like this: python train_gpt2.py using device: mps wrote gpt2_tokenizer.bin loading weights from pretrained gpt: gpt2 loading cached tokens in data/tiny_shakespeare_val.bin wrote gpt2_124M.bin wrote gpt2_124M_debug_state.bin iteration 0, loss: 5.2700090408325195, time: 5398.335ms iteration 1, loss: 4.059708595275879, time: 271.103ms ... iteration 9, loss: 0.3765036463737488, time: 270.090ms final 20 iters avg: 785.775ms <|endoftext|>! !! !! !! !! !!

and I can go ahead successfully with OMP_NUM_THREADS=8 ./train_gpt2

... step 39: train loss 3.970751 (took 3422.735000 ms) val loss 4.107781 generating:

Come Running Away, Greater conquer With the Imperial blood the heaviest host of the gods into this wondrous world beyond. I will not back thee, for how sweet after birth Netflix against repounder, will not flourish against the earlocks of Allay

step 40: train loss 4.377757 (took 3428.180000 ms)

Apr 17 '24 15:04 eddie1788

This is NOT a bug, it is expected.

I'm sorry, I don't see how this should be the expected behavior, especially considering how switching to Torch <= 2.1.2 yields intelligible text...

Also, I don't see how running train_gpt2, that is the compiled C program, would be relevant, as it does not use PyTorch.

Apr 17 '24 15:04 davmacario

llm.c llm.c copied to clipboard

Generation error on MPS (Torch >= 2.2.0, MacOS 14.4)

... step 39: train loss 3.970751 (took 3422.735000 ms) val loss 4.107781 generating:

Come Running Away, Greater conquer With the Imperial blood the heaviest host of the gods into this wondrous world beyond. I will not back thee, for how sweet after birth Netflix against repounder, will not flourish against the earlocks of Allay

llm.c
llm.c copied to clipboard