BitNet icon indicating copy to clipboard operation
BitNet copied to clipboard

Compilation issue on macOS - setup_env.py

Open ryohajika opened this issue 8 months ago • 16 comments

As other open issues (#158 and #180 ) mention, the compilation process stops after ggml-bitnet-lut.cpp, as logs/compile.log shows. I'm trying to build the BitNet project on macOS 15.4.1, and tried building the project with both the default gcc & g++ option (CommandLineTools and Xcode) and clang-20 option (installed via homebrew, bundled to the llvm package), however, no luck getting the job done. Activity Monitor shows that clang (GCC) is still active and occupies 100% of CPU resources after no log update in the compile.log file and no error log or message came out. Probably the python script or the compiler is falling into a void loop after compiling ggml-bitnet-lut.cpp?

ryohajika avatar Apr 21 '25 15:04 ryohajika

I get the same errors and I tried the default c/c++ compilers provided by XCode 16.3 ( version 17 ) or with vvlm 18, 19 and 20. For default Xcode compliers and vvlm 19 & 20, I get stuck at ggml-bitnet-lut.cpp while for vvlm 18, I am getting an error because of incompatibility with Library Framework.

salauioan avatar Apr 21 '25 16:04 salauioan

I was able to get it to work using pixi with this pyproject.toml, put it in the repo root:

[project]
dependencies = [
    "gguf>=0.1.0",
    "numpy~=1.26.4",
    "protobuf>=4.21.0,<5.0.0",
    "sentencepiece~=0.2.0",
    "torch~=2.2.1",
    "transformers>=4.46.3,<5.0.0",
]
name = "BitNet"
requires-python = ">= 3.9"
version = "0.1.0"

[build-system]
build-backend = "hatchling.build"
requires = ["hatchling"]

[tool.pixi.workspace]
channels = ["conda-forge"]
platforms = ["osx-arm64"]

[tool.pixi.pypi-dependencies]
bitnet = { path = ".", editable = true }

[tool.pixi.dependencies]
cmake = ">=4.0.1,<5"
python = ">=3.9,<3.12"
uv = ">=0.6.14,<0.7"
pip = ">=25.0.1,<26"
cxx-compiler = ">=1.9.0,<2"

from there run pixi shell and then you can run the commands from the readme:

huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv

toffaletti avatar Apr 21 '25 23:04 toffaletti

Thanks so much @toffaletti for your suggestion! Long story short, it did work!! I will leave a note below for the others...

I did that within the BitNet directory and I got the error as follows:

      ValueError: Unable to determine which files to ship inside the wheel using the following heuristics: https://hatch.pypa.io/latest/plugins/builder/
      wheel/#default-file-selection

      The most likely cause of this is that there is no directory that matches the name of your project (BitNet or bitnet).

      At least one file selection option must be defined in the `tool.hatch.build.targets.wheel` table, see: https://hatch.pypa.io/latest/config/build/

      As an example, if you intend to ship a directory named `foo` that resides within a `src` directory located at the root of your project, you can
      define the following:

      [tool.hatch.build.targets.wheel]
      packages = ["src/foo"]

      hint: This usually indicates a problem with the package or the build environment.

I don't know if this is a right thing to do but I added the following at the end of the pyproject.toml file:

[tool.hatch.build.targets.wheel]
packages = [".pixi/wheel"]

Then performed pixi shell then the whole thing did work!!

ryohajika avatar Apr 22 '25 12:04 ryohajika

@ryohajika - what complier are you using, the one shipped with XCode or an llvm based clang ( what exact version) ? I followed the steps suggested by you and @toffaletti but I still get stuck at the same compilation step:

[ 6%] Building CXX object 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/////src/ggml-bitnet-lut.cpp.o

salauioan avatar Apr 22 '25 13:04 salauioan

I am having the same problem, both with the default clang and clang 20 from Homebrew. It looks something in bitnet manages to get clang to hang in an infinite loop, and the proposed solution above looks more like a lucky workaround than a solution.

I tried starting cmake manually, and can confirm that it is is clang that hangs. If I configure the cmake build without the BITNET_ARM_TL1=ON I can get the build to complete.

jacobgorm avatar Apr 22 '25 13:04 jacobgorm

It looks like a problem with the clang optimizer, if I change -O3 to -O0 in the executed command:

/opt/homebrew/Cellar/llvm/20.1.3/bin/clang++ -DACCELERATE_LAPACK_ILP64 -DACCELERATE_NEW_LAPACK -DGGML_BITNET_ARM_TL1 -DGGML_BUILD -DGGML_METAL_EMBED_LIBRARY -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -D_DARWIN_C_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/Users/jhansen/dev/BitNet/3rdparty/llama.cpp/ggml/src/../include -I/Users/jhansen/dev/BitNet/3rdparty/llama.cpp/ggml/src/../../../../include -I/Users/jhansen/dev/BitNet/3rdparty/llama.cpp/ggml/src/. -F/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks -O0 -DNDEBUG -std=gnu++11 -arch arm64 -fPIC -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -MD -MT 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/__/__/__/__/src/ggml-bitnet-lut.cpp.o -MF 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/__/__/__/__/src/ggml-bitnet-lut.cpp.o.d -o 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/__/__/__/__/src/ggml-bitnet-lut.cpp.o -c /Users/jhansen/dev/BitNet/src/ggml-bitnet-lut.cpp

It completes.

jacobgorm avatar Apr 22 '25 13:04 jacobgorm

@ryohajika - what complier are you using, the one shipped with XCode or an llvm based clang ( what exact version) ? I followed the steps suggested by you and @toffaletti but I still get stuck at the same compilation step:

[ 6%] Building CXX object 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/////src/ggml-bitnet-lut.cpp.o

The reason pixi works is that it is creating a shell environment with python, cmake, and clang 18 installed from conda forge.

I forgot to mention, you need to do this on a clean clone of the repo. If you have previously run the scripts you will have cmake files in the build directory and files in other places that still reference things outside the environment pixi is creating with the specific versions needed to make this work.

Once you're inside the pixi shell, you can use "which clang++" and/or "clang --version" to ensure you have the right version and that it isn't the one from Xcode. You should also see a bunch of environment variables set like CC, configuring the compilers with the right isysroot, linker paths, etc.

toffaletti avatar Apr 22 '25 13:04 toffaletti

The problem with the optimizer seems to be with this generated code:

template<int K>
void preprocessor_k(void* B, void* LUT_Scales, void* QLUT) {{
  partial_max_reset((&(((bitnet_float_type*)LUT_Scales)[0])));
  per_tensor_quant(K, (&(((bitnet_float_type*)LUT_Scales)[0])), (&(((bitnet_float_type*)B)[0])));

  lut_ctor<K>((&(((int8_t*)QLUT)[0])), (&(((bitnet_float_type*)B)[0])), (&(((bitnet_float_type*)LUT_Scales)[0])));
}}
void ggml_preprocessor(int m, int k, void* B, void* LUT_Scales, void* QLUT) {
    if (m == 3200 && k == 8640) {
        preprocessor_k<8640>(B, LUT_Scales, QLUT);
    }
    else if (m == 3200 && k == 3200) {
        preprocessor_k<3200>(B, LUT_Scales, QLUT);
    }
    else if (m == 8640 && k == 3200) {
        preprocessor_k<3200>(B, LUT_Scales, QLUT);
    }
}

If I comment out the body of ggml_preprocessor the build completes even with optimizations enabled.

jacobgorm avatar Apr 22 '25 14:04 jacobgorm

If fact it is the pragma unroll's in:

template<int act_k>
inline void lut_ctor(int8_t* qlut, bitnet_float_type* b, bitnet_float_type* lut_scales) {{
#ifdef __ARM_NEON
    int16x8_t vec_lut[16];
    float32_t scales = *lut_scales;
        uint8_t tbl_mask[16];
        tbl_mask[0] = 0;
        tbl_mask[1] = 2;
        tbl_mask[2] = 4;
        tbl_mask[3] = 6;
        tbl_mask[4] = 8;
        tbl_mask[5] = 10;
        tbl_mask[6] = 12;
        tbl_mask[7] = 14;
        tbl_mask[8] = 1;
        tbl_mask[9] = 3;
        tbl_mask[10] = 5;
        tbl_mask[11] = 7;
        tbl_mask[12] = 9;
        tbl_mask[13] = 11;
        tbl_mask[14] = 13;
        tbl_mask[15] = 15;
        uint8x16_t tbl_mask_q = vld1q_u8(tbl_mask);
#pragma unroll
    for (int k = 0; k < act_k / 16; ++k) {{
        float32x4x2_t vec_bs_x0 = vld2q_f32(b + k * 16);
        float32x4x2_t vec_bs_x1 = vld2q_f32(b + k * 16 + 8);
        float32x4_t vec_f_0 = vmulq_n_f32(vec_bs_x0.val[0], scales);
        float32x4_t vec_f_1 = vmulq_n_f32(vec_bs_x0.val[1], scales);
        float32x4_t vec_f_2 = vmulq_n_f32(vec_bs_x1.val[0], scales);
        float32x4_t vec_f_3 = vmulq_n_f32(vec_bs_x1.val[1], scales);
        int32x4_t vec_b_0 = vcvtnq_s32_f32(vec_f_0);
        int32x4_t vec_b_1 = vcvtnq_s32_f32(vec_f_1);
        int32x4_t vec_b_2 = vcvtnq_s32_f32(vec_f_2);
        int32x4_t vec_b_3 = vcvtnq_s32_f32(vec_f_3);
        int16x4_t vec_b16_0 = vmovn_s32(vec_b_0);
        int16x4_t vec_b16_1 = vmovn_s32(vec_b_1);
        int16x4_t vec_b16_2 = vmovn_s32(vec_b_2);
        int16x4_t vec_b16_3 = vmovn_s32(vec_b_3);
        int16x8_t vec_bs_0 = vcombine_s16(vec_b16_0, vec_b16_2);
        int16x8_t vec_bs_1 = vcombine_s16(vec_b16_1, vec_b16_3);
        vec_lut[0] = vdupq_n_s16(0);
        vec_lut[0] = vec_lut[0] - vec_bs_0;
        vec_lut[0] = vec_lut[0] - vec_bs_1;
        vec_lut[1] = vdupq_n_s16(0);
        vec_lut[1] = vec_lut[1] - vec_bs_0;
        vec_lut[2] = vdupq_n_s16(0);
        vec_lut[2] = vec_lut[2] - vec_bs_0;
        vec_lut[2] = vec_lut[2] + vec_bs_1;
        vec_lut[3] = vdupq_n_s16(0);
        vec_lut[3] = vec_lut[3] - vec_bs_1;
        vec_lut[4] = vdupq_n_s16(0);
        vec_lut[5] = vec_bs_1;
        vec_lut[6] = vec_bs_0;
        vec_lut[6] = vec_lut[6] - vec_bs_1;
        vec_lut[7] = vec_bs_0;
        vec_lut[8] = vec_bs_0;
        vec_lut[8] = vec_lut[8] + vec_bs_1;
        Transpose_8_8(&(vec_lut[0]), &(vec_lut[1]), &(vec_lut[2]), &(vec_lut[3]),
                      &(vec_lut[4]), &(vec_lut[5]), &(vec_lut[6]), &(vec_lut[7]));
        Transpose_8_8(&(vec_lut[8]), &(vec_lut[9]), &(vec_lut[10]), &(vec_lut[11]),
                      &(vec_lut[12]), &(vec_lut[13]), &(vec_lut[14]), &(vec_lut[15]));
#pragma unroll
        for (int idx = 0; idx < 8; idx++) {{
            int8x16_t q0_s = vqtbl1q_s8(vreinterpretq_s8_s16(vec_lut[idx]), tbl_mask_q);
            int8x8_t q0_low = vget_low_s8(q0_s);
            int8x8_t q0_high = vget_high_s8(q0_s);
            int8x16_t q1_s = vqtbl1q_s8(vreinterpretq_s8_s16(vec_lut[idx + 8]), tbl_mask_q);
            int8x8_t q1_low = vget_low_s8(q1_s);
            int8x8_t q1_high = vget_high_s8(q1_s);
            vst1_s8(qlut + k * 16 * 8 * 2 + idx * 16 * 2, q0_high);
            vst1_s8(qlut + k * 16 * 8 * 2 + idx * 16 * 2 + 8, q1_high);
            vst1_s8(qlut + k * 16 * 8 * 2 + idx * 16 * 2 + 16, q0_low);
            vst1_s8(qlut + k * 16 * 8 * 2 + idx * 16 * 2 + 24, q1_low);
        }}
    }}
#endif
}}

That are causing to optimizer to hang. If I remove then then compilation completes with -O3.

Btw I found the culprit with the help of the

-Rpass=.* -Rpass-missed=.* -Rpass-analysis=.*

clang commandline options.

jacobgorm avatar Apr 22 '25 14:04 jacobgorm

Perhaps not surprising that the compiler would get into a bit of trouble unrolling a double-nested loop with k=8640 times 8 sub-loops? Possibly this works on clang 18 because it has the sense to ignore the unroll pragmas.

jacobgorm avatar Apr 22 '25 14:04 jacobgorm

Here is a diff to disable the unrolls during codegen:

commit c4fa1193af49d35ddf4b069d451d6a37d7113496 (HEAD -> main)
Author: Jacob Gorm Hansen <[email protected]>
Date:   Tue Apr 22 16:13:28 2025 +0200

    disable codegen large pragma unrolls

diff --git a/utils/codegen_tl1.py b/utils/codegen_tl1.py
index 4c2e7dd..bdc9ff7 100644
--- a/utils/codegen_tl1.py
+++ b/utils/codegen_tl1.py
@@ -120,7 +120,7 @@ inline void lut_ctor(int8_t* qlut, bitnet_float_type* b, bitnet_float_type* lut_
         tbl_mask[14] = 13;\n\
         tbl_mask[15] = 15;\n\
         uint8x16_t tbl_mask_q = vld1q_u8(tbl_mask);\n\
-#pragma unroll\n\
+//#pragma unroll\n\
     for (int k = 0; k < act_k / 16; ++k) {{\n\
         float32x4x2_t vec_bs_x0 = vld2q_f32(b + k * 16);\n\
         float32x4x2_t vec_bs_x1 = vld2q_f32(b + k * 16 + 8);\n\
@@ -159,7 +159,7 @@ inline void lut_ctor(int8_t* qlut, bitnet_float_type* b, bitnet_float_type* lut_
                       &(vec_lut[4]), &(vec_lut[5]), &(vec_lut[6]), &(vec_lut[7]));\n\
         Transpose_8_8(&(vec_lut[8]), &(vec_lut[9]), &(vec_lut[10]), &(vec_lut[11]),\n\
                       &(vec_lut[12]), &(vec_lut[13]), &(vec_lut[14]), &(vec_lut[15]));\n\
-#pragma unroll\n\
+//#pragma unroll\n\
         for (int idx = 0; idx < 8; idx++) {{\n\
             int8x16_t q0_s = vqtbl1q_s8(vreinterpretq_s8_s16(vec_lut[idx]), tbl_mask_q);\n\
             int8x8_t q0_low = vget_low_s8(q0_s);\n\

jacobgorm avatar Apr 22 '25 14:04 jacobgorm

@ryohajika - what complier are you using, the one shipped with XCode or an llvm based clang ( what exact version) ? I followed the steps suggested by you and @toffaletti but I still get stuck at the same compilation step: [ 6%] Building CXX object 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/////src/ggml-bitnet-lut.cpp.o

The reason pixi works is that it is creating a shell environment with python, cmake, and clang 18 installed from conda forge.

I forgot to mention, you need to do this on a clean clone of the repo. If you have previously run the scripts you will have cmake files in the build directory and files in other places that still reference things outside the environment pixi is creating with the specific versions needed to make this work.

Once you're inside the pixi shell, you can use "which clang++" and/or "clang --version" to ensure you have the right version and that it isn't the one from Xcode. You should also see a bunch of environment variables set like CC, configuring the compilers with the right isysroot, linker paths, etc.

Got it - I did check and indeed a pixi based clang is used:

clang --version clang version 18.1.8 Target: arm64-apple-darwin24.4.0 Thread model: posix

Also, I did clone a fresh Bitnet repository and even with clang 18 from pixi, it still gets stuck to [ 6%] Building CXX object 3rdparty/llama.cpp/ggml/src/CMakeFiles/ggml.dir/////src/ggml-bitnet-lut.cpp.o

UPDATE: after I removed from .zshrc CC/CXX, I was able to compile from within pixi shell. Thanks @toffaletti for all your help.

salauioan avatar Apr 22 '25 14:04 salauioan

It seems like the method using Pixi to create a local environment may be the solution to run BitNet on macOS for a while... I did the whole process with a fresh cloned BitNet project. The model I tried (Llama3-8B) along with the official one was hallucinating, and I see a limit in the output length. That's another problem... :) We can close this issue at this point but I will leave this open a little longer if anyone wants to discuss the compilation issue.

ryohajika avatar Apr 23 '25 02:04 ryohajika

Why not just apply my patch and be done with it? The code is clearly broken as is, there is no point unrolling thousands of loops.

jacobgorm avatar Apr 23 '25 09:04 jacobgorm

Yeah I agree, @jacobgorm has found a better solution, the loop unrolling seems excessive. I don't have enough experience with this to know whether clang should support this level of depth for unroll or not, but it definitely shouldn't hang and take hours churning on a single file so it probably warrants a bug report to llvm as well.

toffaletti avatar Apr 23 '25 20:04 toffaletti

After applying the Patch, I managed to compile BitNet. But somehow the Metal Sharder failed to compile. After disabling metal support (by adding "-DGGML_METAL=OFF" to COMPILER_EXTRA_ARGS) I managed to run the model on my M3 Pro Mac (running macOS 15.4.1).

gamperl avatar Apr 28 '25 20:04 gamperl

After applying the Patch, I managed to compile BitNet. But somehow the Metal Sharder failed to compile. After disabling metal support (by adding "-DGGML_METAL=OFF" to COMPILER_EXTRA_ARGS) I managed to run the model on my M3 Pro Mac (running macOS 15.4.1).

Do you have Xcode installed? You might need Xcode to compile the metal shaders.

toffaletti avatar Apr 29 '25 22:04 toffaletti