error "Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1"
Hi exoer:
When I use a Q4 or Q8 quantized model, there are no issues. However, when I switch to a BF16 model, the problem shown in the image occurs. Has anyone encountered this and knows how to solve it?
Best regards
i got the same problem
I think the problem is that numpy doesn't support bf16 and in sharded_inference_engine.py, the program tries to convert mlx tensors to numpy array hence causing the error.
I updated the code to convert to float32 if it tensor type is bf16.
Now, I don't know if it works I don't have a Mac. So i'll be happy if someone tries it. then i can make a PR
I think the problem is that numpy doesn't support bf16 and in sharded_inference_engine.py, the program tries to convert mlx tensors to numpy array hence causing the error.
I updated the code to convert to float32 if it tensor type is bf16.
Now, I don't know if it works I don't have a Mac. So i'll be happy if someone tries it. then i can make a PR
I tried with Deepseek V3 0324 (8-bit) and it works with your version, which didn't before. But its quite slow.
@lordoliver can you check if it slows down non-bf16 models too? or if it breaks the code in worst case scenario?
@lordoliver can you check if it slows down non-bf16 models too? or if it breaks the code in worst case scenario?
I am not much into it, maybe its normal speed. I used the 8-bit-version and it had only 2 tokens per second on 3x studio ultra3s 512GB. The 4bit version of Deepseek R1 made 10/s, which is also not really fast. its more or less equal to version before your changes.