lorax Sample command with mistral-7b failed

trafficstars

System Info

Nvidia GPU A100*8 Linux OS

❯ /usr/local/cuda/bin/nvcc --version                                   
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

git clone https://github.com/predibase/lorax.git
cd lorax
git checkout tags/v0.8.1 -b h/released

docker pull ghcr.io/predibase/lorax:sha256-d997075349d9c35cc9a23750acc8d25ee5d5131a4b945565b349ce8724f9ede5

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

~/lorax h/released ❯ sudo docker run --gpus=all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:latest --model-id $model

2024-03-07T22:06:09.111266Z  INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.0, hostname: "9c3939ffb852", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-07T22:06:09.111376Z  INFO download: lorax_launcher: Starting download process.
2024-03-07T22:06:14.313172Z  INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.

2024-03-07T22:06:15.915642Z  INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-07T22:06:15.915916Z  INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-07T22:06:25.922460Z  INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-03-07T22:06:33.331594Z  INFO lorax_launcher: server.py:291 Server started at unix:///tmp/lorax-server-0

2024-03-07T22:06:33.426982Z  INFO shard-manager: lorax_launcher: Shard ready in 17.51044382s rank=0
2024-03-07T22:06:33.525883Z  INFO lorax_launcher: Starting Webserver
2024-03-07T22:06:33.539933Z  INFO lorax_router: router/src/main.rs:202: Loading tokenizer mistralai/Mistral-7B-Instruct-v0.1
2024-03-07T22:06:33.539963Z  INFO lorax_router: router/src/main.rs:222: Using the Hugging Face API
2024-03-07T22:06:33.539985Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-03-07T22:06:33.634015Z  WARN lorax_router: router/src/main.rs:443: `--revision` is not set
2024-03-07T22:06:33.634030Z  WARN lorax_router: router/src/main.rs:444: We strongly advise to set it to a known supported commit.
2024-03-07T22:06:33.793267Z  INFO lorax_router: router/src/main.rs:465: Serving revision 73068f3702d050a2fd5aa2ca1e612e5036429398 of model mistralai/Mistral-7B-Instruct-v0.1
2024-03-07T22:06:33.803598Z  INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-07T22:06:51.593450Z  INFO lorax_launcher: flash_causal_lm.py:781 Memory remaining for kv cache: 64131.25 MB

2024-03-07T22:06:51.761476Z  INFO lorax_router: router/src/main.rs:335: Setting max batch total tokens to 521232
2024-03-07T22:06:51.761532Z  INFO lorax_router: router/src/main.rs:336: Connected
2024-03-07T22:06:51.761541Z  WARN lorax_router: router/src/main.rs:341: Invalid hostname, defaulting to 0.0.0.0
2024-03-07T22:06:51.775118Z  INFO lorax_router::server: router/src/server.rs:974: CORS: origin: Const("*"), methods: Const(Some("GET,POST")), headers: Const(Some("content-type")), expose-headers: Const(None) credentials: No
2024-03-07T22:06:51.775135Z  INFO lorax_router::server: router/src/server.rs:986: CORS: CorsLayer { allow_credentials: No, allow_headers: Const(Some("content-type")), allow_methods: Const(Some("GET,POST")), allow_origin: Const("*"), allow_private_network: No, expose_headers: Const(None), max_age: Exact(None), vary: Vary(["origin", "access-control-request-method", "access-control-request-headers"]) }
thread 'tokio-runtime-worker' panicked at /usr/src/router/src/server.rs:794:26:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-03-07T22:06:51.936963Z ERROR lorax_launcher: Webserver Crashed
2024-03-07T22:06:51.936994Z  INFO lorax_launcher: Shutting down shards
2024-03-07T22:06:52.055462Z  INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed

Expected behavior

Server can start

Mar 07 '24 22:03 hayleyhu

Hey @hayleyhu, thanks for reporting this. This is a surprising error. Could you try running the same command, but including the environment variable RUST_BACKTRACE=1 and sharing the full log output?

Example:

docker run -e RUST_BACKTRACE=1 ...

Mar 10 '24 20:03 tgaddair

Hello @tgaddair , I encountered the same problem when testing the image "ghcr.io/predibase/lorax:latest". Here are the logs:

docker run --gpus '"device=7"' -e RUST_BACKTRACE=1 --shm-size 1g  -p 8081:80 -v /model_dir:/data ghcr.io/predibase/lorax:latest --model-id /data/Qwen-14B-Chat --trust-remote-code

2024-03-11T02:31:06.117503Z  INFO lorax_launcher: Args { model_id: "/data/Qwen-14B-Chat", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.0, hostname: "3ef400c8e367", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-11T02:31:06.117556Z  WARN lorax_launcher: `trust_remote_code` is set. Trusting that model `/data/Qwen-14B-Chat` do not contain malicious code.
2024-03-11T02:31:06.117744Z  INFO download: lorax_launcher: Starting download process.
2024-03-11T02:31:09.676052Z  INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.

2024-03-11T02:31:10.721726Z  INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-11T02:31:10.722129Z  INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-11T02:31:20.730706Z  INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-03-11T02:31:25.287915Z  INFO lorax_launcher: server.py:291 Server started at unix:///tmp/lorax-server-0

2024-03-11T02:31:25.334414Z  INFO shard-manager: lorax_launcher: Shard ready in 14.611113274s rank=0
2024-03-11T02:31:25.432031Z  INFO lorax_launcher: Starting Webserver
2024-03-11T02:31:25.464515Z  INFO lorax_router: router/src/main.rs:202: Loading tokenizer /data/Qwen-14B-Chat
2024-03-11T02:31:25.464578Z  INFO lorax_router: router/src/main.rs:206: Using local tokenizer: /data/Qwen-14B-Chat
2024-03-11T02:31:25.464601Z  WARN lorax_router: router/src/main.rs:251: Could not find a fast tokenizer implementation for /data/Qwen-14B-Chat
2024-03-11T02:31:25.464605Z  WARN lorax_router: router/src/main.rs:252: Rust input length validation and truncation is disabled
2024-03-11T02:31:25.464609Z  WARN lorax_router: router/src/main.rs:277: no pipeline tag found for model /data/Qwen-14B-Chat
2024-03-11T02:31:25.485387Z  INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-11T02:31:57.331056Z  INFO lorax_launcher: flash_causal_lm.py:781 Memory remaining for kv cache: 3082.375 MB

2024-03-11T02:31:57.572087Z  INFO lorax_router: router/src/main.rs:335: Setting max batch total tokens to 12128
2024-03-11T02:31:57.572120Z  INFO lorax_router: router/src/main.rs:336: Connected
2024-03-11T02:31:57.572134Z  WARN lorax_router: router/src/main.rs:341: Invalid hostname, defaulting to 0.0.0.0
2024-03-11T02:31:57.573058Z  INFO lorax_router::server: router/src/server.rs:974: CORS: origin: Const("*"), methods: Const(Some("GET,POST")), headers: Const(Some("content-type")), expose-headers: Const(None) credentials: No
2024-03-11T02:31:57.573079Z  INFO lorax_router::server: router/src/server.rs:986: CORS: CorsLayer { allow_credentials: No, allow_headers: Const(Some("content-type")), allow_methods: Const(Some("GET,POST")), allow_origin: Const("*"), allow_private_network: No, expose_headers: Const(None), max_age: Exact(None), vary: Vary(["origin", "access-control-request-method", "access-control-request-headers"]) }
thread 'tokio-runtime-worker' panicked at /usr/src/router/src/server.rs:794:26:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:127:5
   3: core::option::Option<T>::unwrap
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/option.rs:935:21
   4: lorax_router::server::request_logger::{{closure}}
             at ./router/src/server.rs:794:22
   5: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328:17
   6: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/loom/std/unsafe_cell.rs:16:9
   7: tokio::runtime::task::core::Core<T,S>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:317:30
   8: std::panicking::try::do_call
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:504:40
   9: std::panicking::try
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:468:19
  10: std::panic::catch_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panic.rs:142:14
  11: tokio::runtime::task::harness::poll_future
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:473:18
  12: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:208:27
  13: tokio::runtime::task::harness::Harness<T,S>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:153:15
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-03-11T02:31:57.855398Z ERROR lorax_launcher: Webserver Crashed
2024-03-11T02:31:57.855429Z  INFO lorax_launcher: Shutting down shards
2024-03-11T02:31:57.931701Z  INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed

Mar 11 '24 02:03 Nipi64310

Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to Option::unwrap() is still being hidden somehow. Can you try running docker pull ghcr.io/predibase/lorax:latest to ensure you're running the latest image and set RUST_BACKTRACE=full to get the full stack trace? Thanks.

Mar 11 '24 04:03 tgaddair

Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to Option::unwrap() is still being hidden somehow. Can you try running docker pull ghcr.io/predibase/lorax:latest to ensure you're running the latest image and set RUST_BACKTRACE=full to get the full stack trace? Thanks.

Hi @tgaddair , thanks for getting back to me. I've now updated to the latest Docker image and I can start it now.

Mar 11 '24 07:03 Nipi64310

Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to Option::unwrap() is still being hidden somehow. Can you try running docker pull ghcr.io/predibase/lorax:latest to ensure you're running the latest image and set RUST_BACKTRACE=full to get the full stack trace? Thanks.

Hi @tgaddair , thanks for getting back to me. I've now updated to the latest Docker image and I can start it now.

Hello @tgaddair , Loaded Qwen-72B-Chat-Int4, encountered RuntimeError: CUDA error: an illegal memory access was encountered. Loading Qwen-14B-Chat-Int4 gives the correct result. Here is the error log:

docker run --gpus '"device=2,3,4,5"' -e RUST_BACKTRACE=full --shm-size 1g  -p 8081:80 -v /Qwen/:/data ghcr.nju.edu.cn/predibase/lorax:latest --model-id /data/Qwen-72B-Chat-Int4 --adapter-source local --trust-remote-code --quantize gptq


2024-03-11T09:24:56.420409Z  INFO lorax_launcher: Starting Webserver
2024-03-11T09:24:56.457190Z  INFO lorax_router: router/src/main.rs:202: Loading tokenizer /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459163Z  INFO lorax_router: router/src/main.rs:206: Using local tokenizer: /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459186Z  WARN lorax_router: router/src/main.rs:251: Could not find a fast tokenizer implementation for /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459265Z  WARN lorax_router: router/src/main.rs:252: Rust input length validation and truncation is disabled
2024-03-11T09:24:56.459270Z  WARN lorax_router: router/src/main.rs:277: no pipeline tag found for model /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.503452Z  INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-11T09:24:59.348856Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 330, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 80, in Warmup
    max_supported_total_tokens = self.model.warmup(batch, request.max_new_tokens)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 746, in warmup
    _, batch = self.generate_token(batch, is_warmup=True)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 878, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 875, in generate_token
    out = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 833, in forward
    return model.forward(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 476, in forward
    hidden_states = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 433, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 358, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 227, in forward
    qkv = self.c_attn(hidden_states, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 601, in forward
    result = self.base_layer(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 399, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward
    out = QuantLinearFunction.apply(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248
    matmul_248_kernel[grid](
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run
    timings = {
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
    config: self._bench(*args, config=config, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench
    return triton.testing.do_bench(
  File "/opt/conda/lib/python3.10/site-packages/triton/testing.py", line 103, in do_bench
    torch.cuda.synchronize()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 801, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


2024-03-11T09:25:05.191066Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Error: Warmup(Generation("CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-03-11T09:25:05.227846Z ERROR lorax_launcher: Webserver Crashed
2024-03-11T09:25:05.227884Z  INFO lorax_launcher: Shutting down shards
2024-03-11T09:25:05.576928Z  INFO shard-manager: lorax_launcher: Shard terminated rank=0
2024-03-11T09:25:05.599339Z  INFO shard-manager: lorax_launcher: Shard terminated rank=2
2024-03-11T09:25:05.599523Z  INFO shard-manager: lorax_launcher: Shard terminated rank=3
2024-03-11T09:25:05.643815Z  INFO shard-manager: lorax_launcher: Shard terminated rank=1
Error: WebserverFailed

Mar 11 '24 09:03 Nipi64310

Hey @Nipi64310, can you share the output of nvidia-smi? It looks like the warmup process is running out of memory. You may need to try reducing these values:

max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048

@hayleyhu can you try pulling the latest image and see if that resolves the unwrap() panic?

Mar 11 '24 17:03 tgaddair

Okay, I think I see what's happening here. The unwrap error is occurring because of PR #309, which was accidentally pushing latest images during development.

cc @magdyksaleh

Let's make sure we only push dev images with a specific tag for the branch. I'll see if there's something we can do to prevent this automatically. In the meantime, I'll see if we can retag the current latest with the last commit to main.

Mar 11 '24 19:03 tgaddair

@magdyksaleh confirmed the latest image has been fixed to be tagged from main.

Mar 11 '24 20:03 tgaddair

Hey @Nipi64310, can you share the output of nvidia-smi? It looks like the warmup process is running out of memory. You may need to try reducing these values:
max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048
@hayleyhu can you try pulling the latest image and see if that resolves the unwrap() panic?

Hi @tgaddair , I specified ---max-input-length 128 --max-batch-prefill-tokens 512 --max-batch-total-tokens 512 --max-total-tokens 512, but I'm still getting the same error log.

docker run --gpus '"device=2,3,4,5"' -e RUST_BACKTRACE=full --shm-size 1g  -p 8081:80 -v /Qwen:/data ghcr.nju.edu.cn/predibase/lorax:latest --model-id /data/Qwen-72B-Chat-Int4 --adapter-source local --quantize gptq --max-input-length 128 --max-batch-prefill-tokens 512 --max-batch-total-tokens 512  --max-total-tokens 512 --trust-remote-code

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


2024-03-12T02:40:12.730824Z ERROR warmup{max_input_length=128 max_prefill_tokens=512 max_total_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Error: Warmup(Generation("CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-03-12T02:40:12.751591Z ERROR lorax_launcher: Webserver Crashed
2024-03-12T02:40:12.751620Z  INFO lorax_launcher: Shutting down shards
2024-03-12T02:40:13.041195Z  INFO shard-manager: lorax_launcher: Shard terminated rank=2
2024-03-12T02:40:13.064553Z  INFO shard-manager: lorax_launcher: Shard terminated rank=1
2024-03-12T02:40:13.091416Z  INFO shard-manager: lorax_launcher: Shard terminated rank=3
2024-03-12T02:40:13.138504Z  INFO shard-manager: lorax_launcher: Shard terminated rank=0

Mar 12 '24 02:03 Nipi64310

Thanks my original question was resolved!

Mar 29 '24 22:03 hayleyhu

lorax lorax copied to clipboard

Sample command with mistral-7b failed

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard