BitNet icon indicating copy to clipboard operation
BitNet copied to clipboard

Can't run inference anymore

Open p-arndt opened this issue 8 months ago • 9 comments

A few days ago i could run the inference as normal. Today is just doesn't work anymore.

I downloaded the model and then setup according to the docs.

LOGS:

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3955 (a8ac7072) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 24 key-value pairs and 332 tensors from models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet-b1.58
llama_model_loader: - kv   1:                               general.name str              = bitnet2b
llama_model_loader: - kv   2:                    bitnet-b1.58.vocab_size u32              = 128256
llama_model_loader: - kv   3:                bitnet-b1.58.context_length u32              = 4096
llama_model_loader: - kv   4:              bitnet-b1.58.embedding_length u32              = 2560
llama_model_loader: - kv   5:                   bitnet-b1.58.block_count u32              = 30
llama_model_loader: - kv   6:           bitnet-b1.58.feed_forward_length u32              = 6912
llama_model_loader: - kv   7:          bitnet-b1.58.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:          bitnet-b1.58.attention.head_count u32              = 20
llama_model_loader: - kv   9:       bitnet-b1.58.attention.head_count_kv u32              = 5
llama_model_loader: - kv  10:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  11: bitnet-b1.58.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                bitnet-b1.58.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 40
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type i2_s:  210 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'bitnet-b1.58'
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf'
main: error: unable to load model
Error occurred while running command: Command '['build/bin/llama-cli', '-m', 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf', '-n', '128', '-t', '2', '-p', 'Hi', '-ngl', '0', '-c', '2048', '--temp', '0.8', '-b', '1', '-cnv']' returned non-zero exit status 1.

p-arndt avatar Apr 24 '25 15:04 p-arndt

I'm also experiencing this with Electron-BitNet app with the latest GGUF uploaded by Microsoft to huggingface.

I updated my local bitnet repo to the latest version and recompiled it and still got the same result.

Running the following commands:

huggingface-cli download microsoft/BitNet-b1.58-2B-4T --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s

Results in the following terminal error:

python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T.
INFO:root:Converting HF model to GGUF format...
ERROR:root:Error occurred while running command: Command '['C:\\Users\\usr\\anaconda3\\envs\\bitnet-cpp\\python.exe', 'utils/convert-hf-to-gguf-bitnet.py', 'models/BitNet-b1.58-2B-4T', '--outtype', 'f32']' returned non-zero exit status 1., check details in logs\convert_to_f32_gguf.log

Then the following log:

INFO:hf-to-gguf:Loading model: BitNet-b1.58-2B-4T
Traceback (most recent call last):
  File "C:\Users\usr\Desktop\git\BitNet\utils\convert-hf-to-gguf-bitnet.py", line 1165, in <module>
    main()
  File "C:\Users\usr\Desktop\git\BitNet\utils\convert-hf-to-gguf-bitnet.py", line 1143, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "C:\Users\usr\Desktop\git\BitNet\utils\convert-hf-to-gguf-bitnet.py", line 240, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'BitNetForCausalLM' not supported!

Related pull request: https://github.com/microsoft/BitNet/pull/212

Related issues:

https://github.com/microsoft/BitNet/issues/193

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/discussions/5

grctest avatar Apr 24 '25 19:04 grctest

@grctest surprisingly I could compile the model, but can't run it though.

Also had the same issue like you before but somehow it's gone and now stuck on inference.

It would be a big benefit if the versions of every package and model are fixed so those issue's don't happen.

I saw they're referencing llama.cpp in 3rd party folder, maybe there has been some changes which causing the errors here.

p-arndt avatar Apr 24 '25 19:04 p-arndt

Despite the Converting HF model to GGUF format... step failing, the compile step fixed what was wrong with running inference with the latest Huggingface model seemingly? I'll try to get a release out to test it further.

grctest avatar Apr 24 '25 19:04 grctest

OK, My latest version runs inference fine now: https://github.com/grctest/Electron-BitNet/releases/tag/v0.3.2

The python setup scripts still need to be fixed however.

grctest avatar Apr 24 '25 20:04 grctest

@grctest what have you changed to get inference running again?

p-arndt avatar Apr 24 '25 20:04 p-arndt

A few days ago i could run the inference as normal. Today is just doesn't work anymore.

I downloaded the model and then setup according to the docs.

LOGS:

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3955 (a8ac7072) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 24 key-value pairs and 332 tensors from models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet-b1.58
llama_model_loader: - kv   1:                               general.name str              = bitnet2b
llama_model_loader: - kv   2:                    bitnet-b1.58.vocab_size u32              = 128256
llama_model_loader: - kv   3:                bitnet-b1.58.context_length u32              = 4096
llama_model_loader: - kv   4:              bitnet-b1.58.embedding_length u32              = 2560
llama_model_loader: - kv   5:                   bitnet-b1.58.block_count u32              = 30
llama_model_loader: - kv   6:           bitnet-b1.58.feed_forward_length u32              = 6912
llama_model_loader: - kv   7:          bitnet-b1.58.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:          bitnet-b1.58.attention.head_count u32              = 20
llama_model_loader: - kv   9:       bitnet-b1.58.attention.head_count_kv u32              = 5
llama_model_loader: - kv  10:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  11: bitnet-b1.58.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                bitnet-b1.58.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 40
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type i2_s:  210 tensors
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'bitnet-b1.58'
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf'
main: error: unable to load model
Error occurred while running command: Command '['build/bin/llama-cli', '-m', 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf', '-n', '128', '-t', '2', '-p', 'Hi', '-ngl', '0', '-c', '2048', '--temp', '0.8', '-b', '1', '-cnv']' returned non-zero exit status 1.

Please update to the latest version of code with:

git pull --recurse-submodules

junhuihe-hjh avatar Apr 25 '25 05:04 junhuihe-hjh

There is a gguf model update on Hugging Face, which may cause this issue if you have not sync the code to latest version.

sd983527 avatar Apr 25 '25 11:04 sd983527

The latest code version has these issues in it @sd983527 specifically in the python scripts - a breaking change.

@Padi2312 I ran the following commands:

huggingface-cli download microsoft/BitNet-b1.58-2B-4T --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s

This let me download the non-gguf model, then I converted it to a GGUF.

Then running this command:

python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s

Resulted in this output:

INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T.
INFO:root:Converting HF model to GGUF format...
ERROR:root:Error occurred while running command: Command '['C:\\Users\\usr\\anaconda3\\envs\\bitnet-cpp\\python.exe', 'utils/convert-hf-to-gguf-bitnet.py', 'models/BitNet-b1.58-2B-4T', '--outtype', 'f32']' returned non-zero exit status 1., check details in logs\convert_to_f32_gguf.log

Now, whilst the last step indicates that inference will not work because of the weird changes to the naming convention out of nowhere, however the 'compiling the code using cmake' successfully built the executables which I then included in my electron application - inference now works.

So you could try interacting directly with the executables, or wait until the project maintainers fix the broken python scripts.

grctest avatar Apr 25 '25 11:04 grctest

you should not download the fp version, it should be the gguf file instead. thus it will not trigger model conversion Image

sd983527 avatar Apr 25 '25 12:04 sd983527