When is the server version coming?
I would like to run this model consistently in memory without loading it each time and I can't run it as a command line. It is useful. Is there a serial version in plan?
please refer to https://github.com/microsoft/BitNet/issues/78
To compile the server:
- first set build server flag
cmake -S . -B build -DLLAMA_BUILD_SERVER=ON
- then reset setup_env
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
- In the build/bin/ directory you will see llama-server, standard llama-server usage
./build/bin/llama-server -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf --port 18080 -t 3 -np 2 --prio 3
Thanks, I am getting this error:
cpp/bin/activate
(bitnet-cpp) lco@rtx:/mnt/nvme0n1/LLM/git/BitNet$ python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T.
INFO:root:Converting HF model to GGUF format...
ERROR:root:Error occurred while running command: Command '['/mnt/nvme0n1/LLM/git/BitNet/bitnet-cpp/bin/python', 'utils/convert-hf-to-gguf-bitnet.py', 'models/BitNet-b1.58-2B-4T', '--outtype', 'f32']' returned non-zero exit status 1., check details in logs/convert_to_f32_gguf.log
(bitnet-cpp) lco@rtx:/mnt/nvme0n1/LLM/git/BitNet$ cat logs/convert_to_f32_gguf.log
INFO:hf-to-gguf:Loading model: BitNet-b1.58-2B-4T
Traceback (most recent call last):
File "/mnt/nvme0n1/LLM/git/BitNet/utils/convert-hf-to-gguf-bitnet.py", line 1165, in <module>
main()
File "/mnt/nvme0n1/LLM/git/BitNet/utils/convert-hf-to-gguf-bitnet.py", line 1143, in main
model_class = Model.from_model_architecture(hparams["architectures"][0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme0n1/LLM/git/BitNet/utils/convert-hf-to-gguf-bitnet.py", line 240, in from_model_architecture
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'BitNetForCausalLM' not supported!
I could not make it run either. I have always had the same error
python setup_env.py -md ~/.cache/llama.cpp/microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf -q i2_s
Traceback (most recent call last):
File "/home/zezen/Downloads/BitNet/setup_env.py", line 232, in <module>
main()
File "/home/zezen/Downloads/BitNet/setup_env.py", line 208, in main
gen_code()
File "/home/zezen/Downloads/BitNet/setup_env.py", line 188, in gen_code
raise NotImplementedError()
NotImplementedError
Tips and random musings:
Try: cmake --build build --config Release, as the build files were ready but not built yet.
Lots of warnings:
3rdparty/llama.cpp/ggml/src/ggml-cpu/ops.cpp:7:
/home/zezen/Downloads/BitNet/3rdparty/llama.cpp/ggml/src/ggml-cpu/vec.h:412:16: warning: compound literals are a C99-specific feature [-Wc99-extensions]
y[i] = GGML_FP32_TO_FP16(v * fminf(1.0f, fmaxf(0.0f, (v + 3.0f) / 6.0f)));
^
note: expanded from macro 'GGML_FP32_TO_FP16'
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
^
yet it compiles fine:
ls build/bin/
libggml-base.so libggml-cpu.so libggml.so libllama.so
ldd build/bin/libggml-base.so
linux-vdso.so.1 (0x00007ffec3fdb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f72a6522000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f72a6200000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f72a64fe000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f72a5e00000)
/lib64/ld-linux-x86-64.so.2 (0x00007f72a6710000)
and still the same python error, due to:
def setup_gguf():
# Install the pip package
run_command([sys.executable, "-m", "pip", "install", "3rdparty/llama.cpp/gguf-py"], log_step="install_gguf")
def gen_code():
_, arch = system_info()
llama3_f3_models = set([model['model_name'] for model in SUPPORTED_HF_MODELS.values() if model['model_name'].startswith("Falcon3") or model['model_name'].startswith("Llama")])
if arch == "arm64":
if args.use_pretuned:
pretuned_kernels = os.path.join("preset_kernels", get_model_name())
if not os.path.exists(pretuned_kernels):
logging.error(f"Pretuned kernels not found for model {args.hf_repo}")
sys.exit(1)
if args.quant_type == "tl1":
shutil.copyfile(os.path.join(pretuned_kernels, "bitnet-lut-kernels-tl1.h"), "include/bitnet-lut-kernels.h")
shutil.copyfile(os.path.join(pretuned_kernels, "kernel_config_tl1.ini"), "include/kernel_config.ini")
elif args.quant_type == "tl2":
shutil.copyfile(os.path.join(pretuned_kernels, "bitnet-lut-kernels-tl2.h"), "include/bitnet-lut-kernels.h")
shutil.copyfile(os.path.join(pretuned_kernels, "kernel_config_tl2.ini"), "include/kernel_config.ini")
if get_model_name() == "bitnet_b1_58-large":
run_command([sys.executable, "utils/codegen_tl1.py", "--model", "bitnet_b1_58-large", "--BM", "256,128,256", "--BK", "128,64,128", "--bm", "32,64,32"], log_step="codegen")
elif get_model_name() in llama3_f3_models:
run_command([sys.executable, "utils/codegen_tl1.py", "--model", "Llama3-8B-1.58-100B-tokens", "--BM", "256,128,256,128", "--BK", "128,64,128,64", "--bm", "32,64,32,64"], log_step="codegen")
elif get_model_name() == "bitnet_b1_58-3B":
run_command([sys.executable, "utils/codegen_tl1.py", "--model", "bitnet_b1_58-3B", "--BM", "160,320,320", "--BK", "64,128,64", "--bm", "32,64,32"], log_step="codegen")
elif get_model_name() == "BitNet-b1.58-2B-4T":
run_command([sys.executable, "utils/codegen_tl1.py", "--model", "bitnet_b1_58-3B", "--BM", "160,320,320", "--BK", "64,128,64", "--bm", "32,64,32"], log_step="codegen")
else:
raise NotImplementedError()
else:
if args.use_pretuned:
# cp preset_kernels/model_name/bitnet-lut-kernels_tl1.h to include/bitnet-lut-kernels.h
pretuned_kernels = os.path.join("preset_kernels", get_model_name())
if not os.path.exists(pretuned_kernels):
logging.error(f"Pretuned kernels not found for model {args.hf_repo}")
sys.exit(1)
shutil.copyfile(os.path.join(pretuned_kernels, "bitnet-lut-kernels-tl2.h"), "include/bitnet-lut-kernels.h")
if get_model_name() == "bitnet_b1_58-large":
run_command([sys.executable, "utils/codegen_tl2.py", "--model", "bitnet_b1_58-large", "--BM", "256,128,256", "--BK", "96,192,96", "--bm", "32,32,32"], log_step="codegen")
elif get_model_name() in llama3_f3_models:
run_command([sys.executable, "utils/codegen_tl2.py", "--model", "Llama3-8B-1.58-100B-tokens", "--BM", "256,128,256,128", "--BK", "96,96,96,96", "--bm", "32,32,32,32"], log_step="codegen")
elif get_model_name() == "bitnet_b1_58-3B":
run_command([sys.executable, "utils/codegen_tl2.py", "--model", "bitnet_b1_58-3B", "--BM", "160,320,320", "--BK", "96,96,96", "--bm", "32,32,32"], log_step="codegen")
elif get_model_name() == "BitNet-b1.58-2B-4T":
run_command([sys.executable, "utils/codegen_tl2.py", "--model", "bitnet_b1_58-3B", "--BM", "160,320,320", "--BK", "96,96,96", "--bm", "32,32,32"], log_step="codegen")
else:
raise NotImplementedError()
- models mismatch, presumably, as:
/usr/local/bin/llama-cli --version
version: 5172 (eb1776b1)
built with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
pip show gguf
Name: gguf
Version: 0.16.2
Summary: Read and write ML models in GGUF for GGML
Home-page: https://ggml.ai
Author: GGML
Author-email: [email protected]
License:
Location: /home/zezen/.local/lib/python3.10/site-packages
Requires: numpy, pyyaml, sentencepiece, tqdm
Required-by: llama-cpp-scripts
which (as unpatched, I presume) , cause:
pip show gguf
Name: gguf
Version: 0.16.2
Summary: Read and write ML models in GGUF for GGML
Home-page: https://ggml.ai
Author: GGML
Author-email: [email protected]
License:
Location: /home/zezen/.local/lib/python3.10/site-packages
Requires: numpy, pyyaml, sentencepiece, tqdm
Required-by: llama-cpp-scripts
...
gguf-dump [...] /ggml-model-i2_s.gguf
...
File "/usr/lib/python3.10/enum.py", line 385, in __call__
return cls.__new__(cls, value)
File "/usr/lib/python3.10/enum.py", line 710, in __new__
raise ve_exc
ValueError: np.uint32(36) is not a valid GGMLQuantizationType
I suspect that we both have the "regular" llama-cli etc. built separately:
locate libggml-base.so
...
~/.local/lib/python3.10/site-packages/lib/libggml-base.so
~/.local/lib/python3.10/site-packages/llama_cpp/lib/libggml-base.so
~/Downloads/llama.cpp/build/bin/libggml-base.so
[this freshly built one]
/usr/local/lib/libggml-base.so
, so the .so files etc. clash.
Update, I have pasted also my musings to Grok and it (rightly methinks) claims that we are using the wrong GGUF files and are not isolating the pythons there.
Grok (trimmed, sorry for this raw paste):
...
Root Causes Let’s dissect the causes based on the error and your observations: Model Name Mismatch in setup_env.py: The gen_code() function expects the model name to match one of the predefined values. The model file path (~/.cache/llama.cpp/microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf) suggests you’re using the BitNet-b1.58-2B-4T model, but get_model_name() (likely parsing the model path or metadata) might not return the expected string (BitNet-b1.58-2B-4T). Possible reasons: The model file’s metadata (e.g., GGUF headers) doesn’t match what setup_env.py expects. The gguf package or llama.cpp version is outdated, causing incorrect parsing of the model file. The path format or naming convention confuses the script. GGUF Quantization Type Error: The gguf-dump error indicates that the i2_s quantization type (enum value 36) is not recognized by gguf version 0.16.2. This suggests that the GGUF file uses a newer quantization format introduced in a later version of llama.cpp or BitNet’s fork of llama.cpp. The BitNet README specifies using i2_s for quantization [Web:1], but your gguf package or llama.cpp library might not support it due to version mismatches. ... When setup_env.py runs, Python’s dynamic linker (ld) might load the wrong libggml-base.so (e.g., from /usr/local/lib or ~/.local/lib), causing version incompatibilities. This could break GGUF parsing or kernel generation. Clang Version: BitNet requires Clang 18 for optimal performance and compatibility [Web:3]. Your Clang 14 build succeeded, but the generated libraries might lack specific optimizations or features required by setup_env.py’s code generation.Steps to Resolve the Issue To fix the NotImplementedError, GGUF quantization error, and library conflicts, follow these steps systematically: Update the gguf Package: The gguf package version 0.16.2 is outdated and doesn’t support the i2_s quantization type (enum 36). Update to the latest version or install BitNet’s fork of gguf-py: bash pip uninstall gguf pip install --no-cache-dir -r ~/Downloads/BitNet/requirements.txt pip install ~/Downloads/BitNet/3rdparty/llama.cpp/gguf-py Verify the version: bash pip show gguf The latest version (as of April 2025) should be >=0.17.0 or a custom fork from BitNet. Isolate the Python Environment: To avoid library conflicts, create a clean Conda environment as recommended by the BitNet README [Web:3]: bash conda create -n bitnet-cpp python=3.9 conda activate bitnet-cpp pip install -r ~/Downloads/BitNet/requirements.txt pip install ~/Downloads/BitNet/3rdparty/llama.cpp/gguf-py This ensures setup_env.py uses the correct gguf package and avoids loading incompatible libraries from ~/.local/lib.
Fix Model Path and Naming: The NotImplementedError suggests get_model_name() doesn’t recognize the model. The model path (~/.cache/llama.cpp/microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf) is unusual, as it’s nested in ~/.cache/llama.cpp and has a non-standard name. Download the model directly to a clean directory: bash mkdir -p ~/Downloads/BitNet/models/BitNet-b1.58-2B-4T huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir ~/Downloads/BitNet/models/BitNet-b1.58-2B-4T Run setup_env.py with the correct path: bash python setup_env.py -md ~/Downloads/BitNet/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -q i2_s If the model name still fails, explicitly specify the Hugging Face repo: bash python setup_env.py --hf-repo microsoft/BitNet-b1.58-2B-4T -md ~/Downloads/BitNet/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -q i2_s
Resolve Library Conflicts: To prevent setup_env.py from loading the wrong libggml-base.so, set the LD_LIBRARY_PATH to prioritize BitNet’s libraries: bash export LD_LIBRARY_PATH=~/Downloads/BitNet/build/bin:$LD_LIBRARY_PATH python setup_env.py -md ~/Downloads/BitNet/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -q i2_s Alternatively, uninstall conflicting llama-cpp-python packages: bash pip uninstall llama-cpp-python llama-cpp-scripts If you need llama.cpp separately, keep its libraries isolated (e.g., don’t install to /usr/local/lib). Rebuild with Clang 18: Your Clang 14 build works but may lack BitNet-specific optimizations. Install Clang 18 as recommended [Web:3]: bash sudo bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" Clean and rebuild: bash cd ~/Downloads/BitNet rm -rf build cmake -S . -B build -DLLAMA_BUILD_SERVER=ON -DCMAKE_C_COMPILER=clang-18 -DCMAKE_CXX_COMPILER=clang++-18 cmake --build build --config Release This ensures compatibility with BitNet’s optimized kernels. Verify GGUF File: Check the GGUF file’s metadata to confirm it’s compatible: bash python -m gguf.tools dump ~/Downloads/BitNet/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf If the gguf-dump error persists, the GGUF file may be from a newer llama.cpp version. Update BitNet’s llama.cpp submodule: bash cd ~/Downloads/BitNet git submodule update --init --recursive rm -rf build cmake -S . -B build -DLLAMA_BUILD_SERVER=ON -DCMAKE_C_COMPILER=clang-18 -DCMAKE_CXX_COMPILER=clang++-18 cmake --build build --config Release Patch setup_env.py (Temporary Workaround):
[I skip this one ;) and pages of more interesting and sound general advice ]
Thanks, I am getting this error:
cpp/bin/activate (bitnet-cpp) lco@rtx:/mnt/nvme0n1/LLM/git/BitNet$ python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s INFO:root:Compiling the code using CMake. INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T. INFO:root:Converting HF model to GGUF format... ERROR:root:Error occurred while running command: Command '['/mnt/nvme0n1/LLM/git/BitNet/bitnet-cpp/bin/python', 'utils/convert-hf-to-gguf-bitnet.py', 'models/BitNet-b1.58-2B-4T', '--outtype', 'f32']' returned non-zero exit status 1., check details in logs/convert_to_f32_gguf.log (bitnet-cpp) lco@rtx:/mnt/nvme0n1/LLM/git/BitNet$ cat logs/convert_to_f32_gguf.log INFO:hf-to-gguf:Loading model: BitNet-b1.58-2B-4T Traceback (most recent call last): File "/mnt/nvme0n1/LLM/git/BitNet/utils/convert-hf-to-gguf-bitnet.py", line 1165, in
main() File "/mnt/nvme0n1/LLM/git/BitNet/utils/convert-hf-to-gguf-bitnet.py", line 1143, in main model_class = Model.from_model_architecture(hparams["architectures"][0]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/nvme0n1/LLM/git/BitNet/utils/convert-hf-to-gguf-bitnet.py", line 240, in from_model_architecture raise NotImplementedError(f'Architecture {arch!r} not supported!') from None NotImplementedError: Architecture 'BitNetForCausalLM' not supported!
Are you trying to convert the model by yourself? I'd suggest you download the model from HF and run it directly.
microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf
please follow the instruction to download the model (python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s). model name is sensitive since it will check if the model exists and if model name is different, it will trigger conversion which may fail. thanks.
Re:
Are you trying to convert the model by yourself? I'd suggest you download the model from HF and run it directly.
Indeed, we may have been also trying to convert the wrong models, hand downloaded ones.
Let us retry, almost comme il faut (yet without virtualization still, no conda):
/BitNet$ ls
3rdparty assets build CMakeLists.txt CODE_OF_CONDUCT.md docs include LICENSE logs media models preset_kernels README.md requirements.txt run_inference.py SECURITY.md setup_env.py src utils
git pull
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1 (from 1)
Unpacking objects: 100% (1/1), 911 bytes | 911.00 KiB/s, done.
From https://github.com/microsoft/BitNet
fd9f1d6..c17d1c5 main -> origin/main
Updating fd9f1d6..c17d1c5
Fast-forward
3rdparty/llama.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]Downloading 'README.md' to 'models/BitNet-b1.58-2B-4T/.cache/huggingface/download/README.md.c4a6897e03fbb0320ded5e0b686d8a5e1968154c.incomplete'
Downloading '.gitattributes' to 'models/BitNet-b1.58-2B-4T/.cache/huggingface/download/.gitattributes.4e3e1a539c8d36087c5f8435e653b7dc694a0da6.incomplete'
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.96k/8.96k [00:00<00:00, 62.4MB/s]
Download complete. Moving file to models/BitNet-b1.58-2B-4T/README.md | 0.00/8.96k [00:00<?, ?B/s]
.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.64k/1.64k [00:00<00:00, 17.5MB/s]
Download complete. Moving file to models/BitNet-b1.58-2B-4T/.gitattributes | 0.00/1.64k [00:00<?, ?B/s]
Fetching 3 files: 33%|████████████████████████████████████████████████████▋ | 1/3 [00:01<00:02, 1.07s/it]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:05<00:00, 21.80s/it]
... /BitNet/models/BitNet-b1.58-2B-4T
ls ... /test/BitNet/models/BitNet-b1.58-2B-4T
ggml-model-i2_s.gguf README.md
So far so good as:
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T.
INFO:root:GGUF model already exists at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
But:
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T.
INFO:root:GGUF model already exists at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
and the session freezes.
Quick debugging, over my morning coffee:
build/bin/llama-cli --version
version: 3955 (a8ac7072)
built with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
- so it finally got patched, as vs.:
/usr/local/bin/llama-cli --version
version: 5172 (eb1776b1)
built with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
Let us try it by hand:
ls build/bin/llama-cli
build/bin/llama-cli
so only this can be used to test the right gguf.
For the sake of it, let us try the "old" gguf dumper gguf-dump gguf-dump models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf before the potential new session freeze :
.local/lib/python3.10/site-packages/gguf/gguf_reader.py", line 151, in _get
.newbyteorder(override_order or self.byte_order)
AttributeError: `newbyteorder` was removed from the ndarray class in NumPy 2.0. Use `arr.view(arr.dtype.newbyteorder(order))` instead
- a promising new error this time, so maybe something got patched in the previous step.
Let us try the command (stripped somehow) by hand:
build/bin/llama-cli -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -n 128 -t 2 -p 'You are a helpful assistant' -cnv
build: 3955 (a8ac7072) with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
Segmentation fault
stracing it:
openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
getrandom("\xf7\x69\x75\x64\xa4\xbb\x33\x21", 8, GRND_NONBLOCK) = 8
brk(NULL) = 0x5584203a3000
brk(0x5584203c4000) = 0x5584203c4000
getdents64(3, 0x5584203a32d0 /* 26 entries */, 32768) = 752
getdents64(3, 0x5584203a32d0 /* 0 entries */, 32768) = 0
close(3) = 0
sched_getaffinity(99734, 8, [0, 1, 2, 3, 4, 5, 6, 7]) = 8
futex(0x7f4ba56697fc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
brk(0x5584203e5000) = 0x5584203e5000
openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(3, "processor\t: 0\nvendor_id\t: Genuin"..., 1024) = 1024
close(3) = 0
brk(0x558420406000) = 0x558420406000
rt_sigaction(SIGRT_1, {sa_handler=0x7f4ba5091870, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f4ba5042520}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4ba47ff000
mprotect(0x7f4ba4800000, 8388608, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7f4ba4fff910, parent_tid=0x7f4ba4fff910, exit_signal=0, stack=0x7f4ba47ff000, stack_size=0x7ffe80, tls=0x7f4ba4fff640} => {parent_tid=[99735]}, 88) = 99735
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x55841ecae090, FUTEX_WAKE_PRIVATE, 1) = 1
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCGETSbuild: 3955 (a8ac7072) with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = 0
ioctl(0, TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
openat(AT_FDCWD, "/dev/tty", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 4
newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=226915712, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 226915712, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7f4b92600000
close(4) = 0
futex(0x55841ecae0e8, FUTEX_WAKE_PRIVATE, 1) = 1
main: llama backend init
main: load the model and apply lora adapter, if any
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x100000000} ---
+++ killed by SIGSEGV +++
Segmentation fault
Interestinger and interestinger, as all run at the vanilla almost:
uname -a
Linux above-hp2-silver 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
My interim guess (thanks to Grok too): Library Mismatch: The llama-cli binary is using incompatible libraries (libllama.so, libggml.so) from /usr/local/lib, which are from llama.cpp (version 5172) and don’t support BitNet’s i2_s quantization or model format. as:
ldd build/bin/llama-cli
linux-vdso.so.1 (0x00007fff3f7f9000)
libllama.so => /usr/local/lib/libllama.so (0x00007fbb2c97c000)
libggml.so => /usr/local/lib/libggml.so (0x00007fbb2c96f000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fbb2c600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbb2c519000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fbb2c91f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbb2c200000)
libggml-base.so => /usr/local/lib/libggml-base.so (0x00007fbb2c445000)
/lib64/ld-linux-x86-64.so.2 (0x00007fbb2cbc8000)
libggml-cpu.so => /usr/local/lib/libggml-cpu.so (0x00007fbb2c144000)
libggml-rpc.so => /usr/local/lib/libggml-rpc.so (0x00007fbb2c90b000)
libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fbb2c8b5000)
- I shall test it all on a clean machine soon.
microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf
Please be more specific.
I wish to run it with llama-server or anyhow to get it running in the memory.
I have this model
/mnt/nvme0n1/LLM/git/BitNet/models/microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf
I got this one too:
/mnt/nvme0n1/LLM/git/BitNet/models/ggml-model-i2_s.gguf
I have this model: /mnt/nvme0n1/LLM/git/BitNet/models/BitNet-b1.58-2B-4T
In this directory: /mnt/nvme0n1/LLM/git/BitNet
I try this:
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s INFO:root:Compiling the code using CMake. INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T. INFO:root:Converting HF model to GGUF format... ERROR:root:Error occurred while running command: Command '['/mnt/nvme0n1/LLM/git/BitNet/bitnet-cpp/bin/python', 'utils/convert-hf-to-gguf-bitnet.py', 'models/BitNet-b1.58-2B-4T', '--outtype', 'f32']' returned non-zero exit status 1., check details in logs/convert_to_f32_gguf.log
What should I do to make it run?
microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf
please follow the instruction to download the model (python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s). model name is sensitive since it will check if the model exists and if model name is different, it will trigger conversion which may fail. thanks.
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s INFO:root:Compiling the code using CMake. INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T. INFO:root:GGUF model already exists at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
I got it this far.
But then here problem:
./build/bin/llama-server -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
error:
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'bitnet-b1.58'
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf'
srv load_model: failed to load model, 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf'
main: exiting due to model loading error
@gnusupport seems relevant https://github.com/microsoft/BitNet/issues/226
microsoft_bitnet-b1.58-2B-4T-gguf_ggml-model-i2_s.gguf
please follow the instruction to download the model (python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s). model name is sensitive since it will check if the model exists and if model name is different, it will trigger conversion which may fail. thanks.
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s INFO:root:Compiling the code using CMake. INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T. INFO:root:GGUF model already exists at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
I got it this far.
But then here problem:
./build/bin/llama-server -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf error:
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'bitnet-b1.58' llama_load_model_from_file: failed to load model common_init_from_params: failed to load model 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf' srv load_model: failed to load model, 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf' main: exiting due to model loading error
are you loading the model on https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ? there is an update to this model recently. and you need to git pull --recurse-submodules to pull the latest code changes.
Re: git pull --recurse-submodules - that newer version would be the most elegant if it would work with https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf that has puzzled us all, without virtualization or holding left ear with the right hand, all backwards.
FYI, on Unix:
git pull --recurse-submodules
Fetching submodule 3rdparty/llama.cpp
From https://github.com/ggml-org/llama.cpp
eb1776b1..558a7647 master -> origin/master
* [new branch] gg/embeddings-no-kv -> origin/gg/embeddings-no-kv
+ c078fd00...dec80ace gg/llama-kv-cache-v6 -> origin/gg/llama-kv-cache-v6 (forced update)
* [new branch] gg/model-cards -> origin/gg/model-cards
* [new branch] jg/llama-opt-3 -> origin/jg/llama-opt-3
+ 70b46910...beed9b38 sycl/unary_all -> origin/sycl/unary_all (forced update)
* [new tag] b5190 -> b5190
* [new tag] b5173 -> b5173
* [new tag] b5174 -> b5174
* [new tag] b5175 -> b5175
* [new tag] b5176 -> b5176
* [new tag] b5177 -> b5177
* [new tag] b5178 -> b5178
* [new tag] b5180 -> b5180
* [new tag] b5181 -> b5181
* [new tag] b5184 -> b5184
* [new tag] b5185 -> b5185
* [new tag] b5186 -> b5186
* [new tag] b5187 -> b5187
* [new tag] b5188 -> b5188
* [new tag] b5189 -> b5189
Could not access submodule 'ggml/src/ggml-kompute/kompute'
Errors during submodule fetch:
3rdparty/llama.cpp
so :
cd ..
mv BitNet Bitnet.2
and downloading from scratch seems to be needed:
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
Not a biggie - testing it now...
Done git cloning anew.
Instead of any cmake (does not work), I am trying as per the RTFM: huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T and then
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory models/BitNet-b1.58-2B-4T.
INFO:root:GGUF model already exists at models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
Compiles, seemingly both in Ubuntu and Termux.
But dejavu:
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
build: 3957 (5eb47b72) with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
Error occurred while running command: Command '['build/bin/llama-cli', '-m', 'models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf', '-n', '128', '-t', '2', '-p', 'You are a helpful assistant', '-ngl', '0', '-c', '2048', '--temp', '0.8', '-b', '1', '-cnv']' died with <Signals.SIGSEGV: 11>.
so I shall try on another virgin machine, after all.
try https://github.com/microsoft/BitNet/pull/204 please, should work for local use
try #204 please, should work for local use
Thank you Benjamin.
This is what I did:
$ git fetch origin pull/204/head:pr-branch-name
$ git checkout pr-branch-name
$ git checkout main
$ git merge pr-branch-name
I have already downloaded the model: models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
and then
$ python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
and then it workd with:
$ python run_inference_server.py --model models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf --prompt "You are assistant" --n-predict 4096 --threads 2 --ctx-size 2048 --temperature 0.8 --host 127.0.0.1 --port 8080
so please close the issue if it worked for you
That: https://github.com/microsoft/BitNet/issues/206#issuecomment-2832035916 helped - it compiles. After a couple of git reset --hard origin/main type operations, however, deja vu:
build/bin/llama-server -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -c 2048 -t 2 -n 4096 --host 127.0.0.1 --port 8080 -cb -p 'You are assistant'
build: 3955 (a8ac7072) with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
system info: n_threads = 2, n_threads_batch = 2, total_threads = 8
system_info: n_threads = 2 (n_threads_batch = 2) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 7
main: loading model
Segmentation fault
After a chat with an AI or two, about:
ldd build/bin/llama-server
linux-vdso.so.1 (0x00007fffffdf8000)
libllama.so => /usr/local/lib/libllama.so (0x00007ffb916ab000)
libggml.so => /usr/local/lib/libggml.so (0x00007ffb9169e000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ffb91400000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ffb91319000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ffb912f5000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffb91000000)
libggml-base.so => /usr/local/lib/libggml-base.so (0x00007ffb90f2c000)
/lib64/ld-linux-x86-64.so.2 (0x00007ffb919c6000)
libggml-cpu.so => /usr/local/lib/libggml-cpu.so (0x00007ffb91239000)
libggml-rpc.so => /usr/local/lib/libggml-rpc.so (0x00007ffb90f18000)
libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007ffb90ec4000
(yes, on that old machine)
I have settled on :
export LD_LIBRARY_PATH=$(pwd)/build/3rdparty/llama.cpp/src/:$(pwd)/build/3rdparty/llama.cpp//ggml/src/:$LD_LIBRARY_PATH to disable the old .so file clashes.
Only now it works on that other gguf:
build/bin/llama-server -m /mnt/HP_P7_Data/Temp/GPT4All_DBs/Bitnet_MS/ggml-model-i2_s.gguf -c 2048 -t 2 -n 4096 --host 127.0.0.1 --port 8080 -cb -p 'You are assistant'
build: 3955 (a8ac7072) with Ubuntu clang version 14.0.0-1ubuntu1.1 for x86_64-pc-linux-gnu
system info: n_threads = 2, n_threads_batch = 2, total_threads = 8
system_info: n_threads = 2 (n_threads_batch = 2) / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 7
main: loading model
llama_model_loader: loaded meta data with 24 key-value pairs and 333 tensors from /mnt/HP_P7_Data/Temp/GPT4All_DBs/Bitnet_MS/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bitnet-25
llama_model_loader: - kv 1: general.name str = bitnet2b_2501
llama_model_loader: - kv 2: bitnet-25.vocab_size u32 = 128256
llama_model_loader: - kv 3: bitnet-25.context_length u32 = 4096
llama_model_loader: - kv 4: bitnet-25.embedding_length u32 = 2560
llama_model_loader: - kv 5: bitnet-25.block_count u32 = 30
llama_model_loader: - kv 6: bitnet-25.feed_forward_length u32 = 6912
llama_model_loader: - kv 7: bitnet-25.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: bitnet-25.attention.head_count u32 = 20
llama_model_loader: - kv 9: bitnet-25.attention.head_count_kv u32 = 5
llama_model_loader: - kv 10: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 11: bitnet-25.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: bitnet-25.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 13: general.file_type u32 = 40
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 128001
llama_model_loader: - kv 22: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type f16: 2 tensors
llama_model_loader: - type i2_s: 210 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
Thanks.
Ver. 1.2
Update: a note to myself mostly: doing the above on Droid this time. I have decoded what that setup_env.py file is doing. All the log output hidden by hard pipe to logs, no tee to terminal, needlessly. I am trying this then: cmake sped up by 8 times or more and debugging possible finally:
# Create log directory
mkdir -p logs
# 1. setup_gguf(): Install gguf-py package
python -m pip install 3rdparty/llama.cpp/gguf-py
# 2. gen_code(): Generate code for BitNet-b1.58-2B-4T with i2_s quantization
python utils/codegen_tl1.py --model bitnet_b1_58-3B --BM 160,320,320 --BK 64,128,64 --bm 32,64,32
# 3. compile(): Compile the codebase with CMake
cmake -B build -DBITNET_ARM_TL1=ON
cmake --build build --config Release --parallel 8
# 4. prepare_model(): Convert model to GGUF format (i2_s quantization)
# Check if model directory exists (assumed to exist as per -md argument)
# Convert to f32 GGUF
python utils/convert-hf-to-gguf-bitnet.py models/BitNet-b1.58-2B-4T --outtype f32
# Quantize f32 to i2_s
./build/bin/llama-quantize models/BitNet-b1.58-2B-4T/ggml-model-f32.gguf models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf I2_S 1
....
FYi, after experimenting with the above, to clear up some PKG_CONFIG , LD_PATH like weirdnesses in my prooted Debian, it finally compiles and runs there, even the usual way. Yipee!
Re the needed export LD_LIBRARY_PATH=$(pwd)/build/3rdparty/llama.cpp/src/:$(pwd)/build/3rdparty/llama.cpp//ggml/src/:$LD_LIBRARY_PATH trick so that it does not run into segmentation fault. It works, but as I forget about it now and then -
ChatGPT suggests that it all should have been compiled with rpath, e.g.
✅ Modify CMakeLists.txt (Top Level) Place this after setting the output directories (CMAKE_RUNTIME_OUTPUT_DIRECTORY, etc.):
# Use $ORIGIN so binaries find libs relative to their location
set(CMAKE_SKIP_BUILD_RPATH FALSE)
set(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE)
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)
set(CMAKE_INSTALL_RPATH "$ORIGIN/../lib") # or $ORIGIN if libs are next to binaries
💡 Explanation: $ORIGIN = path of the binary at runtime
../lib = if your .so files are in a lib/ folder next to the bin/ folder
Ensures libraries are found without setting LD_LIBRARY_PATH
The maintainers may think adding some version thereof .
Ver. 1.2