private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this

Open rexzhang2023 opened this issue 2 years ago • 21 comments

llama.cpp: loading model from ./models/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4113739.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 2052.00 MB per state) ................................................................................................... I am using a recommended model, but I get this error message. How do you think I could solve it?

rexzhang2023 avatar May 09 '23 07:05 rexzhang2023

Is this the full output? If not - please post the full output

toninog avatar May 09 '23 11:05 toninog

(privategpt) root@alienware17B:/home/rex/privateGPT# python privateGPT.py llama.cpp: loading model from ./models/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4113739.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 2052.00 MB per state) ................................................................................................... . llama_init_from_file: kv self size = 512.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285 Enter a query:

rexzhang2023 avatar May 09 '23 11:05 rexzhang2023

I got the same messages.

dennydream avatar May 09 '23 22:05 dennydream

(venv) my@laptop:~/privateGPT$ python privateGPT.py llama.cpp: loading model from ./models/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4113748.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state) ................................................................................................... . llama_init_from_file: kv self size = 512.00 MB AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Using embedded DuckDB with persistence: data will be stored in: db Illegal instruction (core dumped)

TechVentureBuilder avatar May 11 '23 06:05 TechVentureBuilder

I got the same errors

timonweb avatar May 11 '23 16:05 timonweb

(privategpt) root@alienware17B:/home/rex/privateGPT# python privateGPT.py llama.cpp: loading model from ./models/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4113739.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 2052.00 MB per state) ................................................................................................... . llama_init_from_file: kv self size = 512.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285 Enter a query:

That seems to be working, Look at "Enter a query:". I always get the message about mmap because I use old llama7b and 13b models. I just ignore it.

alxspiker avatar May 11 '23 16:05 alxspiker

@alxspiker I'm getting the error at the ingestion step:

llama.cpp: loading model from ./models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)```

timonweb avatar May 11 '23 17:05 timonweb

I get this error too.

d2rgaming-9000 avatar May 14 '23 23:05 d2rgaming-9000

I get the same error and the embedding process takes a while for the state of the union doc. My machines heats up and fan blows at full speed. I've a M2 MAX with 38 core and 64GB of memory. Inference also takes a long time to and it crashes after 2-3 questions.

manojkr19 avatar May 15 '23 01:05 manojkr19

I get the same error. Any quick help to sort out this error is greatly appreciated.

Puttappaiahm avatar May 15 '23 06:05 Puttappaiahm

llama.cpp: loading model from models/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) I get the same error. Any quick help to sort out this error is greatly appreciated.

Bioleme avatar May 15 '23 07:05 Bioleme

Same error...

wouterstultiens avatar May 15 '23 08:05 wouterstultiens

same here

citron avatar May 15 '23 15:05 citron

This is the error I'm getting:

main: build = 553 (63d2046) main: seed = 1684182263 llama.cpp: loading model from ggml-alpaca-7b-native-q4.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = ggmf v1 (old version with no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305) llama_init_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'ggml-alpaca-7b-native-q4.bin' main: error: unable to load model

jasonogrady avatar May 15 '23 20:05 jasonogrady

main: build = 553 (63d2046) main: seed = 1684197811 llama.cpp: loading model from ./models/ggml-model-q4_1.bin llama_model_load_internal: format = ggjt v1 (pre #1405) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 6656 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 52 llama_model_load_internal: n_layer = 60 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 3 (mostly Q4_1) llama_model_load_internal: n_ff = 17920 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 30B error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305) llama_init_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model './models/ggml-model-q4_1.bin' main: error: unable to load model

ByteEvangelist avatar May 16 '23 00:05 ByteEvangelist

Same issue Loading documents from source_documents Loaded 28 documents from source_documents Split into 2405 chunks of text (max. 500 tokens each) llama.cpp: loading model from models/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1000 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4113748.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)

CraftCanna avatar May 16 '23 04:05 CraftCanna

what's the cause?

zzzgit avatar May 16 '23 06:05 zzzgit

ecs-user@mimiako:~/privateGPT$ python3 ./privateGPT.py 
llama.cpp: loading model from /home/ecs-user/privateGPT/downloadedFiles/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1000
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size  = 1000.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
Using embedded DuckDB with persistence: data will be stored in: /home/ecs-user/privateGPT/vector/
gptj_model_load: loading model from '/home/ecs-user/privateGPT/downloadedFiles/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ...........................Killed

zzzgit avatar May 16 '23 06:05 zzzgit

some of the errors here relate to memory - as in the host system does not have enough memory. When you see "Killed" it looks like the kernel killed the process (python) due to lack of memory

Run a dmesg or check /var/log/messages for more information

toninog avatar May 16 '23 06:05 toninog

same error

dennis-gonzales avatar May 16 '23 07:05 dennis-gonzales

I redownloaded the model and embeddings file, and it goes through after that.

kkski avatar May 16 '23 12:05 kkski

@kkski What do you mean by "goes through" after you redownloaded the models?

JA-Bonilla avatar May 16 '23 18:05 JA-Bonilla

We are not using llama.cpp as the embeddings model anymore. Plus, ingest got a LOT faster with the use of the new embeddings model #224

Note: this is a breaking change, any existing database will stop working with the new changes. You'll need to re-ingest your docs. It is recommended as the process is faster and the results are better.

imartinez avatar May 17 '23 09:05 imartinez