lorax
lorax copied to clipboard
Sample command with mistral-7b failed
System Info
Nvidia GPU A100*8 Linux OS
❯ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
git clone https://github.com/predibase/lorax.git
cd lorax
git checkout tags/v0.8.1 -b h/released
docker pull ghcr.io/predibase/lorax:sha256-d997075349d9c35cc9a23750acc8d25ee5d5131a4b945565b349ce8724f9ede5
model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data
~/lorax h/released ❯ sudo docker run --gpus=all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/predibase/lorax:latest --model-id $model
2024-03-07T22:06:09.111266Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.0, hostname: "9c3939ffb852", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-07T22:06:09.111376Z INFO download: lorax_launcher: Starting download process.
2024-03-07T22:06:14.313172Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.
2024-03-07T22:06:15.915642Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-07T22:06:15.915916Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-07T22:06:25.922460Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-03-07T22:06:33.331594Z INFO lorax_launcher: server.py:291 Server started at unix:///tmp/lorax-server-0
2024-03-07T22:06:33.426982Z INFO shard-manager: lorax_launcher: Shard ready in 17.51044382s rank=0
2024-03-07T22:06:33.525883Z INFO lorax_launcher: Starting Webserver
2024-03-07T22:06:33.539933Z INFO lorax_router: router/src/main.rs:202: Loading tokenizer mistralai/Mistral-7B-Instruct-v0.1
2024-03-07T22:06:33.539963Z INFO lorax_router: router/src/main.rs:222: Using the Hugging Face API
2024-03-07T22:06:33.539985Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-03-07T22:06:33.634015Z WARN lorax_router: router/src/main.rs:443: `--revision` is not set
2024-03-07T22:06:33.634030Z WARN lorax_router: router/src/main.rs:444: We strongly advise to set it to a known supported commit.
2024-03-07T22:06:33.793267Z INFO lorax_router: router/src/main.rs:465: Serving revision 73068f3702d050a2fd5aa2ca1e612e5036429398 of model mistralai/Mistral-7B-Instruct-v0.1
2024-03-07T22:06:33.803598Z INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-07T22:06:51.593450Z INFO lorax_launcher: flash_causal_lm.py:781 Memory remaining for kv cache: 64131.25 MB
2024-03-07T22:06:51.761476Z INFO lorax_router: router/src/main.rs:335: Setting max batch total tokens to 521232
2024-03-07T22:06:51.761532Z INFO lorax_router: router/src/main.rs:336: Connected
2024-03-07T22:06:51.761541Z WARN lorax_router: router/src/main.rs:341: Invalid hostname, defaulting to 0.0.0.0
2024-03-07T22:06:51.775118Z INFO lorax_router::server: router/src/server.rs:974: CORS: origin: Const("*"), methods: Const(Some("GET,POST")), headers: Const(Some("content-type")), expose-headers: Const(None) credentials: No
2024-03-07T22:06:51.775135Z INFO lorax_router::server: router/src/server.rs:986: CORS: CorsLayer { allow_credentials: No, allow_headers: Const(Some("content-type")), allow_methods: Const(Some("GET,POST")), allow_origin: Const("*"), allow_private_network: No, expose_headers: Const(None), max_age: Exact(None), vary: Vary(["origin", "access-control-request-method", "access-control-request-headers"]) }
thread 'tokio-runtime-worker' panicked at /usr/src/router/src/server.rs:794:26:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-03-07T22:06:51.936963Z ERROR lorax_launcher: Webserver Crashed
2024-03-07T22:06:51.936994Z INFO lorax_launcher: Shutting down shards
2024-03-07T22:06:52.055462Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
Expected behavior
Server can start
Hey @hayleyhu, thanks for reporting this. This is a surprising error. Could you try running the same command, but including the environment variable RUST_BACKTRACE=1 and sharing the full log output?
Example:
docker run -e RUST_BACKTRACE=1 ...
Hello @tgaddair , I encountered the same problem when testing the image "ghcr.io/predibase/lorax:latest". Here are the logs:
docker run --gpus '"device=7"' -e RUST_BACKTRACE=1 --shm-size 1g -p 8081:80 -v /model_dir:/data ghcr.io/predibase/lorax:latest --model-id /data/Qwen-14B-Chat --trust-remote-code
2024-03-11T02:31:06.117503Z INFO lorax_launcher: Args { model_id: "/data/Qwen-14B-Chat", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.0, hostname: "3ef400c8e367", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-11T02:31:06.117556Z WARN lorax_launcher: `trust_remote_code` is set. Trusting that model `/data/Qwen-14B-Chat` do not contain malicious code.
2024-03-11T02:31:06.117744Z INFO download: lorax_launcher: Starting download process.
2024-03-11T02:31:09.676052Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.
2024-03-11T02:31:10.721726Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-11T02:31:10.722129Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-11T02:31:20.730706Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-03-11T02:31:25.287915Z INFO lorax_launcher: server.py:291 Server started at unix:///tmp/lorax-server-0
2024-03-11T02:31:25.334414Z INFO shard-manager: lorax_launcher: Shard ready in 14.611113274s rank=0
2024-03-11T02:31:25.432031Z INFO lorax_launcher: Starting Webserver
2024-03-11T02:31:25.464515Z INFO lorax_router: router/src/main.rs:202: Loading tokenizer /data/Qwen-14B-Chat
2024-03-11T02:31:25.464578Z INFO lorax_router: router/src/main.rs:206: Using local tokenizer: /data/Qwen-14B-Chat
2024-03-11T02:31:25.464601Z WARN lorax_router: router/src/main.rs:251: Could not find a fast tokenizer implementation for /data/Qwen-14B-Chat
2024-03-11T02:31:25.464605Z WARN lorax_router: router/src/main.rs:252: Rust input length validation and truncation is disabled
2024-03-11T02:31:25.464609Z WARN lorax_router: router/src/main.rs:277: no pipeline tag found for model /data/Qwen-14B-Chat
2024-03-11T02:31:25.485387Z INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-11T02:31:57.331056Z INFO lorax_launcher: flash_causal_lm.py:781 Memory remaining for kv cache: 3082.375 MB
2024-03-11T02:31:57.572087Z INFO lorax_router: router/src/main.rs:335: Setting max batch total tokens to 12128
2024-03-11T02:31:57.572120Z INFO lorax_router: router/src/main.rs:336: Connected
2024-03-11T02:31:57.572134Z WARN lorax_router: router/src/main.rs:341: Invalid hostname, defaulting to 0.0.0.0
2024-03-11T02:31:57.573058Z INFO lorax_router::server: router/src/server.rs:974: CORS: origin: Const("*"), methods: Const(Some("GET,POST")), headers: Const(Some("content-type")), expose-headers: Const(None) credentials: No
2024-03-11T02:31:57.573079Z INFO lorax_router::server: router/src/server.rs:986: CORS: CorsLayer { allow_credentials: No, allow_headers: Const(Some("content-type")), allow_methods: Const(Some("GET,POST")), allow_origin: Const("*"), allow_private_network: No, expose_headers: Const(None), max_age: Exact(None), vary: Vary(["origin", "access-control-request-method", "access-control-request-headers"]) }
thread 'tokio-runtime-worker' panicked at /usr/src/router/src/server.rs:794:26:
called `Option::unwrap()` on a `None` value
stack backtrace:
0: rust_begin_unwind
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
1: core::panicking::panic_fmt
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:127:5
3: core::option::Option<T>::unwrap
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/option.rs:935:21
4: lorax_router::server::request_logger::{{closure}}
at ./router/src/server.rs:794:22
5: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328:17
6: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/loom/std/unsafe_cell.rs:16:9
7: tokio::runtime::task::core::Core<T,S>::poll
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:317:30
8: std::panicking::try::do_call
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:504:40
9: std::panicking::try
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:468:19
10: std::panic::catch_unwind
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panic.rs:142:14
11: tokio::runtime::task::harness::poll_future
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:473:18
12: tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:208:27
13: tokio::runtime::task::harness::Harness<T,S>::poll
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:153:15
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-03-11T02:31:57.855398Z ERROR lorax_launcher: Webserver Crashed
2024-03-11T02:31:57.855429Z INFO lorax_launcher: Shutting down shards
2024-03-11T02:31:57.931701Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to Option::unwrap() is still being hidden somehow. Can you try running docker pull ghcr.io/predibase/lorax:latest to ensure you're running the latest image and set RUST_BACKTRACE=full to get the full stack trace? Thanks.
Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to
Option::unwrap()is still being hidden somehow. Can you try runningdocker pull ghcr.io/predibase/lorax:latestto ensure you're running the latest image and setRUST_BACKTRACE=fullto get the full stack trace? Thanks.
Hi @tgaddair , thanks for getting back to me. I've now updated to the latest Docker image and I can start it now.
Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to
Option::unwrap()is still being hidden somehow. Can you try runningdocker pull ghcr.io/predibase/lorax:latestto ensure you're running the latest image and setRUST_BACKTRACE=fullto get the full stack trace? Thanks.Hi @tgaddair , thanks for getting back to me. I've now updated to the latest Docker image and I can start it now.
Hello @tgaddair ,
Loaded Qwen-72B-Chat-Int4, encountered RuntimeError: CUDA error: an illegal memory access was encountered. Loading Qwen-14B-Chat-Int4 gives the correct result. Here is the error log:
docker run --gpus '"device=2,3,4,5"' -e RUST_BACKTRACE=full --shm-size 1g -p 8081:80 -v /Qwen/:/data ghcr.nju.edu.cn/predibase/lorax:latest --model-id /data/Qwen-72B-Chat-Int4 --adapter-source local --trust-remote-code --quantize gptq
2024-03-11T09:24:56.420409Z INFO lorax_launcher: Starting Webserver
2024-03-11T09:24:56.457190Z INFO lorax_router: router/src/main.rs:202: Loading tokenizer /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459163Z INFO lorax_router: router/src/main.rs:206: Using local tokenizer: /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459186Z WARN lorax_router: router/src/main.rs:251: Could not find a fast tokenizer implementation for /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459265Z WARN lorax_router: router/src/main.rs:252: Rust input length validation and truncation is disabled
2024-03-11T09:24:56.459270Z WARN lorax_router: router/src/main.rs:277: no pipeline tag found for model /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.503452Z INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-11T09:24:59.348856Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 330, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 80, in Warmup
max_supported_total_tokens = self.model.warmup(batch, request.max_new_tokens)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 746, in warmup
_, batch = self.generate_token(batch, is_warmup=True)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 878, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 875, in generate_token
out = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 833, in forward
return model.forward(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 476, in forward
hidden_states = self.transformer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 433, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 358, in forward
attn_output = self.attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 227, in forward
qkv = self.c_attn(hidden_states, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 601, in forward
result = self.base_layer(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 399, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward
out = QuantLinearFunction.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_fwd
return fwd(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248
matmul_248_kernel[grid](
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run
timings = {
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
config: self._bench(*args, config=config, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench
return triton.testing.do_bench(
File "/opt/conda/lib/python3.10/site-packages/triton/testing.py", line 103, in do_bench
torch.cuda.synchronize()
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 801, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-03-11T09:25:05.191066Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Warmup(Generation("CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-03-11T09:25:05.227846Z ERROR lorax_launcher: Webserver Crashed
2024-03-11T09:25:05.227884Z INFO lorax_launcher: Shutting down shards
2024-03-11T09:25:05.576928Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
2024-03-11T09:25:05.599339Z INFO shard-manager: lorax_launcher: Shard terminated rank=2
2024-03-11T09:25:05.599523Z INFO shard-manager: lorax_launcher: Shard terminated rank=3
2024-03-11T09:25:05.643815Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
Error: WebserverFailed
Hey @Nipi64310, can you share the output of nvidia-smi? It looks like the warmup process is running out of memory. You may need to try reducing these values:
max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048
@hayleyhu can you try pulling the latest image and see if that resolves the unwrap() panic?
Okay, I think I see what's happening here. The unwrap error is occurring because of PR #309, which was accidentally pushing latest images during development.
cc @magdyksaleh
Let's make sure we only push dev images with a specific tag for the branch. I'll see if there's something we can do to prevent this automatically. In the meantime, I'll see if we can retag the current latest with the last commit to main.
@magdyksaleh confirmed the latest image has been fixed to be tagged from main.
Hey @Nipi64310, can you share the output of
nvidia-smi? It looks like the warmup process is running out of memory. You may need to try reducing these values:max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048@hayleyhu can you try pulling the latest image and see if that resolves the unwrap() panic?
Hi @tgaddair , I specified ---max-input-length 128 --max-batch-prefill-tokens 512 --max-batch-total-tokens 512 --max-total-tokens 512, but I'm still getting the same error log.
docker run --gpus '"device=2,3,4,5"' -e RUST_BACKTRACE=full --shm-size 1g -p 8081:80 -v /Qwen:/data ghcr.nju.edu.cn/predibase/lorax:latest --model-id /data/Qwen-72B-Chat-Int4 --adapter-source local --quantize gptq --max-input-length 128 --max-batch-prefill-tokens 512 --max-batch-total-tokens 512 --max-total-tokens 512 --trust-remote-code
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-03-12T02:40:12.730824Z ERROR warmup{max_input_length=128 max_prefill_tokens=512 max_total_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Warmup(Generation("CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-03-12T02:40:12.751591Z ERROR lorax_launcher: Webserver Crashed
2024-03-12T02:40:12.751620Z INFO lorax_launcher: Shutting down shards
2024-03-12T02:40:13.041195Z INFO shard-manager: lorax_launcher: Shard terminated rank=2
2024-03-12T02:40:13.064553Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
2024-03-12T02:40:13.091416Z INFO shard-manager: lorax_launcher: Shard terminated rank=3
2024-03-12T02:40:13.138504Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Thanks my original question was resolved!