exo error "Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1"

Hi exoer:

When I use a Q4 or Q8 quantized model, there are no issues. However, when I switch to a BF16 model, the problem shown in the image occurs. Has anyone encountered this and knows how to solve it?

Best regards

Mar 21 '25 00:03 sivagaga

i got the same problem

Mar 22 '25 00:03 hefish

I think the problem is that numpy doesn't support bf16 and in sharded_inference_engine.py, the program tries to convert mlx tensors to numpy array hence causing the error.

I updated the code to convert to float32 if it tensor type is bf16.

Now, I don't know if it works I don't have a Mac. So i'll be happy if someone tries it. then i can make a PR

Apr 05 '25 13:04 TanishkBansode

I think the problem is that numpy doesn't support bf16 and in sharded_inference_engine.py, the program tries to convert mlx tensors to numpy array hence causing the error.

I updated the code to convert to float32 if it tensor type is bf16.

Now, I don't know if it works I don't have a Mac. So i'll be happy if someone tries it. then i can make a PR

I tried with Deepseek V3 0324 (8-bit) and it works with your version, which didn't before. But its quite slow.

Apr 24 '25 12:04 lordoliver

@lordoliver can you check if it slows down non-bf16 models too? or if it breaks the code in worst case scenario?

Apr 28 '25 03:04 TanishkBansode

@lordoliver can you check if it slows down non-bf16 models too? or if it breaks the code in worst case scenario?

I am not much into it, maybe its normal speed. I used the 8-bit-version and it had only 2 tokens per second on 3x studio ultra3s 512GB. The 4bit version of Deepseek R1 made 10/s, which is also not really fast. its more or less equal to version before your changes.

Apr 28 '25 08:04 lordoliver