llm GPT-2 segfaults when used through the CLI

Trying any GPT-2 GGML model through the CLI appears to cause an immediate segfault:

llama-rs # cargo run --bin llm gpt2 infer -m models/gpt2/cerebras-2.7b-q4_0.bin -p "Now, this is a story all about how"

[...]
[2023-05-01T23:43:17Z INFO  llm::cli_args] Model fully loaded! Elapsed: 75ms
zsh: segmentation fault  cargo run --bin llm gpt2 infer -m models/gpt2/cerebras-2.7b-q4_0.bin -p

This appears to be true regardless of the model (Cerebras and base GPT-2 seem to both suffer from this).

This doesn't happen when run through the GPT-2 example.

May 01 '23 23:05 philpax

I wonder if this has to do w/ loading through the snapshot.

May 02 '23 22:05 danforbes

I am not able to reproduce this problem

llama-rs: ./target/release/llm gpt2 infer -m ~/.ggml-models/cerebras-gpt-13b.bin -p "Hello my name is"
[2023-05-03T17:55:57Z INFO  llm::cli_args] ggml ctx size = 7857.04 MB
    
[2023-05-03T17:55:57Z INFO  llm::cli_args] Loaded tensor 8/485
...
[2023-05-03T17:56:02Z INFO  llm::cli_args] Loaded tensor 480/485
[2023-05-03T17:56:02Z INFO  llm::cli_args] Loading of model complete
[2023-05-03T17:56:02Z INFO  llm::cli_args] Model size = 0.00 MB / num tensors = 485
[2023-05-03T17:56:02Z INFO  llm::cli_args] Model fully loaded! Elapsed: 5008ms
"Hello my name is 'Celest,' and you're looking for a guy named..." "Marius." ""I'm looking for Marius^C

May 03 '23 17:05 danforbes

How weird... is that q4 or f16?

May 03 '23 18:05 philpax

q4? I'm not sure honestly 😅 I think I'm testing w/ this model that appears to have been taken down 🤷🏻 https://huggingface.co/mongolian-basket-weaving/cerebras-gpt-13b-ggml-q4_0

May 03 '23 18:05 danforbes

Is this wrong?

https://github.com/rustformers/llm/blob/be56c36/crates/models/gpt2/src/lib.rs#L314-L316

May 06 '23 15:05 danforbes

Ok, just tested with https://huggingface.co/xzuyn/GPT-2-124M-ggml-q4_1/blob/main/ggml-model-q4_1.bin on macOS:

# cargo run --bin llm gpt2 infer -m models/gpt2/GPT-2-124M-ggml-q4_1.bin -p "1 + 2 = "  
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/llm gpt2 infer -m models/gpt2/GPT-2-124M-ggml-q4_1.bin -p '1 + 2 = '`
✓ Loaded 149 tensors (125.8 MB) after 153ms
zsh: segmentation fault  cargo run --bin llm gpt2 infer -m models/gpt2/GPT-2-124M-ggml-q4_1.bin -p

May 06 '23 17:05 philpax

Is this wrong?

https://github.com/rustformers/llm/blob/be56c36/crates/models/gpt2/src/lib.rs#L314-L316

Aha - I think you've figured it out...

Running with --num-ctx-tokens 1024 doesn't segfault for me. Our default of 2048 doesn't work for all models. Oops.

Or maybe not.

# cargo run --release --bin llm gpt2 infer -m models/gpt2/cerebras-2.7b-q4_1.bin -p "Fred looked at his hand and wondered: "  --num-ctx-tokens 512 
    Finished release [optimized] target(s) in 0.08s
     Running `target/release/llm gpt2 infer -m models/gpt2/cerebras-2.7b-q4_1.bin -p 'Fred looked at his hand and wondered: ' --num-ctx-tokens 512`
✓ Loaded 389 tensors (5.6 GB) after 91ms
zsh: segmentation fault  cargo run --release --bin llm gpt2 infer -m models/gpt2/cerebras-2.7b-q4_1.bi

May 06 '23 17:05 philpax

Quick findings with a debugger:

Only seems to happen with the mmap'd model
The segfault occurs here: data is invalid https://github.com/ggerganov/ggml/blob/ff6e03cbcd9bf6e9fa41d49f2495c042efae4dc6/src/ggml.c#L9146
The only place get_rows is used is here: https://github.com/rustformers/llm/blob/7c2edb13149ff78765134e97190fb1f80a2fa39d/crates/models/gpt2/src/lib.rs#L152-L153
Thus, one of these two tensors is likely not loading correctly through mmap
Using a sane context length and --no-mmap seems to circumvent this for now

This is definitely something we should investigate and fix, but not a showstopper for now, I think.

May 06 '23 18:05 philpax