lorax
lorax copied to clipboard
Error deploying GPTQ models to sagemaker
System Info
I have used the following guide to deploy lorax to sagemaker. I am able to do so successfully using the unquantized models. Have deployed OpenHermes 2.5 successfully. However when i try GPTQ version of Openhermes or Mixtral, I am consistently getting the following error:
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept return await response File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 79, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 726, in warmup _, batch = self.generate_token(batch) File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 855, in generate_token raise e File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 852, in generate_token out = self.forward(batch, adapter_data) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 426, in forward logits = model.forward( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 979, in forward hidden_states = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 922, in forward hidden_states, residual = layer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 849, in forward attn_output = self.self_attn( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 379, in forward qkv = self.query_key_value(hidden_states, adapter_data) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 601, in forward result = self.base_layer(input) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 399, in forward return self.linear.forward(x) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward out = QuantLinearFunction.apply( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_fwd return fwd(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248 matmul_248_kernel[grid]( File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run timings = { File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp> config: self._bench(*args, config=config, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench return triton.testing.do_bench( File "/opt/conda/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench fn() File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 80, in kernel_call self.fn.run( File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 550, in run bin.c_wrapper( File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 692, in __getattribute__ self._init_handles() File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 683, in _init_handles mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device) RuntimeError: Triton Error [CUDA]: device kernel image is invalid
I am using the latest lorax image, and unable to figure out how to resolve this. Can you someone please help in figuring this out?
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [X] My own modifications
Reproduction
- Using A10G (g5.12x large) for deployment - working with fp16 but not GPTQ
Expected behavior
Expected successful deployment, but not working with GPTQ.
Hi @GlacierPurpleBison, taking a look. Do you know if this bug is a SageMaker-specific bug, or if this occurs when initializing a vanilla LoRAX docker container as well?
Hi @geoffreyangus, it isn't working on sagemaker notebooks using docker directly as well. It seems to be getting stuck at waiting for the shard to be ready. When i run the client, i am getting connection refused error. When I use tekium's Openhermes directly I am able to connect to the client and the output is as expected.
` Status: Downloaded newer image for ghcr.io/predibase/lorax:latest 2024-02-16T07:04:28.688487Z INFO lorax_launcher: Args { model_id: "TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ", adapter_id: None, source: "hub", adapter_source: "hub", revision: Some("gptq-4bit-32g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "69d6ffd6f9c6", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_headers: [], cors_allow_methods: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2024-02-16T07:04:28.688602Z INFO download: lorax_launcher: Starting download process. 2024-02-16T07:04:33.112129Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.
2024-02-16T07:04:33.693079Z INFO download: lorax_launcher: Successfully downloaded weights. 2024-02-16T07:04:33.693239Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2024-02-16T07:04:43.702379Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:04:53.710063Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:05:03.717915Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:05:13.726304Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:05:23.734374Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:05:33.742445Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:05:43.750895Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:05:51.652610Z INFO lorax_launcher: weights.py:253 Using exllama kernels
2024-02-16T07:05:53.758610Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:06:03.766768Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:06:13.775196Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2024-02-16T07:06:16.909582Z INFO lorax_launcher: weights.py:253 Using exllama kernels `
I am unfortunately not able to verify this outside of sagemaker since I have a mac and hence can't test it locally for consistency.
I am using the same A10G machine for sagemaker notebook instance. But you'll notice that i am not getting the CUDA Triton error here.
Btw, i am not getting any error when i try and deploy AWQ. Only getting this with GPTQ
hey @geoffreyangus - were you able to check further on this?
Hi @GlacierPurpleBison– I apologize, I took President's day weekend off. I'll take a look at it this week, thanks!