text-generation-inference
                                
                                 text-generation-inference copied to clipboard
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        Quantized BNB-4bit models are not working.
System Info
Testing on 2x 4070 TI Super
      - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit
      - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
text-generation-inference-1  | [rank1]: │ /usr/src/server/text_generation_server/utils/weights.py:275 in get_sharded   │
text-generation-inference-1  | [rank1]: │                                                                              │
text-generation-inference-1  | [rank1]: │   272 │   │   world_size = self.process_group.size()                         │
text-generation-inference-1  | [rank1]: │   273 │   │   size = slice_.get_shape()[dim]                                 │
text-generation-inference-1  | [rank1]: │   274 │   │   assert (                                                       │
text-generation-inference-1  | [rank1]: │ ❱ 275 │   │   │   size % world_size == 0                                     │
text-generation-inference-1  | [rank1]: │   276 │   │   ), f"The choosen size {size} is not compatible with sharding o │
text-generation-inference-1  | [rank1]: │   277 │   │   return self.get_partial_sharded(                               │
text-generation-inference-1  | [rank1]: │   278 │   │   │   tensor_name, dim, to_device=to_device, to_dtype=to_dtype   │
text-generation-inference-1  | [rank1]: │                                                                              │
text-generation-inference-1  | [rank1]: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
text-generation-inference-1  | [rank1]: │ │         dim = 1                                                          │ │
text-generation-inference-1  | [rank1]: │ │           f = <builtins.safe_open object at 0x77297422f7b0>              │ │
text-generation-inference-1  | [rank1]: │ │    filename = '/data/hub/models--unsloth--Qwen2.5-Coder-32B-bnb-4bit/sn… │ │
text-generation-inference-1  | [rank1]: │ │        self = <text_generation_server.utils.weights.Weights object at    │ │
text-generation-inference-1  | [rank1]: │ │               0x772972e7a3d0>                                            │ │
text-generation-inference-1  | [rank1]: │ │        size = 1                                                          │ │
text-generation-inference-1  | [rank1]: │ │      slice_ = <builtins.PySafeSlice object at 0x772973da8f80>            │ │
text-generation-inference-1  | [rank1]: │ │ tensor_name = 'model.layers.0.self_attn.o_proj.weight'                   │ │
text-generation-inference-1  | [rank1]: │ │   to_device = True                                                       │ │
text-generation-inference-1  | [rank1]: │ │    to_dtype = True                                                       │ │
text-generation-inference-1  | [rank1]: │ │  world_size = 2                                                          │ │
text-generation-inference-1  | [rank1]: │ ╰──────────────────────────────────────────────────────────────────────────╯ │
text-generation-inference-1  | [rank1]: ╰──────────────────────────────────────────────────────────────────────────────╯
text-generation-inference-1  | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1
text-generation-inference-1  | 2025-02-10T12:36:10.058627Z ERROR text_generation_launcher: Shard 1 failed to start
text-generation-inference-1  | 2025-02-10T12:36:10.058637Z  INFO text_generation_launcher: Shutting down shards
text-generation-inference-1  | 2025-02-10T12:36:10.065243Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
text-generation-inference-1  | 2025-02-10T12:36:10.065344Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
text-generation-inference-1  | 2025-02-10T12:36:10.165431Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
text-generation-inference-1  | Error: ShardCannotStart
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:3.1.0
    environment:
      - HF_TOKEN=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
      # - MODEL_ID=Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
      # - MODEL_ID=mistralai/Mistral-Small-24B-Instruct-2501
      # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
      # - MODEL_ID=avoroshilov/DeepSeek-R1-Distill-Qwen-32B-GPTQ_4bit-128g
      # - MODEL_ID=Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ
      # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
      - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit
      # - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
      # - SHARDED=true
      # - SHARDS=2
      # - QUANTIZED=bitsandbytes
    ports:
      - "0.0.0.0:8099:80"
    restart: "unless-stopped"
    # command: "--quantize bitsandbytes-nf4 --max-input-tokens 30000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    shm_size: '90g'
    volumes:
      - ~/.hf-docker-data:/data
    networks:
      - llmhost
Expected behavior
Unquant ones works fine with "--quantize bitsandbytes-nf4 --max-input-tokens 30000"