lorax
lorax copied to clipboard
Fails hard on CUDA error
System Info
We are using streaming v1 chat completions API. After some amount of requests or a request with large enough context lorax server fails to respond. And all consequent requests also fail.
infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
we are running it in docker with 1 GPU on A100 PCIe runpod.io:
lorax-launcher --model-id microsoft/phi-2 --adapter-source s3 --compile --dtype bfloat16 --port 3000 --revision ef382358ec9e382308935a992d908de099b64c23 --max-input-length 2000 --max-total-tokens 2048 --env
2024-06-22T01:38:49.630259Z INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.74.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Jun 22 01:38:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:E1:00.0 Off | 0 |
| N/A 34C P0 61W / 300W | 71234MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
2024-06-22T01:38:49.630346Z INFO lorax_launcher: Args { model_id: "microsoft/phi-2", adapter_id: None, source: "hub", adapter_source: "s3", revision: Some("ef382358ec9e382308935a992d908de099b64c23"), validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: true, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "960a5e26c0d7", port: 3000, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false }
full request log:
2024-06-22T01:12:03.526879786Z 2024-06-22T01:12:03.526727Z ERROR HTTP request{
otel.name=POST
/v1/chat/completions
http.flavor=1.1
http.method=POST
http.route=/v1/chat/completions
http.scheme=HTTP
http.target=/v1/chat/completions
http.user_agent=Ktor
client
otel.kind=server
trace_id=e68f52322fc88977fb39f91db1970199
http.status_code=200 otel.status_code="OK"
}:chat_completions_v1{default_return_full_text=Extension(false) info=Extension(Info {
model_id: "microsoft/phi-2",
model_sha: Some("ef382358ec9e382308935a992d908de099b64c23"),
model_dtype: "torch.bfloat16",
model_device_type: "cuda",
model_pipeline_tag: Some("text-generation"),
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_input_length: 2000,
max_total_tokens: 2048,
waiting_served_ratio: 1.2,
max_batch_total_tokens: 188144,
max_waiting_tokens: 20,
validation_workers: 2,
version: "0.1.0",
sha: None,
docker_label: None,
request_logger_url: None }
) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x55bd96f750e0,
tail_position: 0 },
semaphore: Semaphore { semaphore: Semaphore { permits: 32 },
bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } })
req_headers={
"user-agent": "Ktor client",
"content-length": "5341",
"accept": "text/event-stream,application/json",
"accept-charset": "UTF-8",
"accept-encoding": "gzip,
br",
"authorization": "Bearer",
"cache-control": "no-cache",
"cdn-loop": "cloudflare",
"cf-ipcountry": "US",
"cf-ray": "89785e2c4ce3ce40-SJC",
"cf-visitor": "{\"scheme\":\"https\"}",
"content-type": "application/json", "x-forwarded-for": "", "x-forwarded-host": "some.proxy.runpod.net",
"x-forwarded-proto": "https"}}:async_stream:generate_stream{
request=GenerateRequest { inputs: "[{\"content\":\"You are a helpful assistant. Write your answers using markdown markup.\",\"role\":\"system\"},{\"content\":\"write a quick sort in kotlin\",\"role\":\"user\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a quick sort implementation in Kotlin:\\n```kotlin\\nfun quickSort(array: Array<Int>): Array<Int> {\\n if (array.size <= 1) {\\n return array\\n }\\n var pivot = array[array.size / 2]\\n var left = Array<Int>()\\n var right = Array<Int>()\\n for (i in 0 until array.size) {\\n if (array[i] < pivot) {\\n left += array[i]\\n } else {\\n right += array[i]\\n }\\n }\\n return quickSort(left) + Array(pivot) + quickSort(right)\\n}\\n```\\nThis implementation uses the divide-and-conquer approach to sort an array. It selects a pivot element and partitions the array into two sub-arrays, one containing elements smaller than the pivot and the other containing elements greater than the pivot. It then recursively sorts the sub-arrays and combines them with the pivot element to obtain the sorted array.\\n\"},{\"content\":\"write an SQL query to select all users that were active this year\",\"role\":\"user\"},{\"content\":\"Sure, here's an SQL query that selects all users that were active this year:\\n```sql\\nSELECT * FROM users\\nWHERE active_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)\\nAND active_date <= CURDATE();\\n```\\nThis query selects all rows from the `users` table where the `active_date` column is greater than or equal to one year ago from the current date and less than or equal to the current date. This will return all users that were active this year.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a pocker simulator in kotlin\"},{\"content\":\"Sure, here's a simple pocker simulator in Kotlin:\\n```kotlin\\nfun main(args: Array<String>) {\\n val pocker = Pocker()\\n val pocker.start()\\n val pocker.stop()\\n println(\\\"Pocker stopped.\\\")\\n}\\n```\\nThis implementation creates a new instance of the `Pocker` class and starts the pocker process. It then stops the pocker process and prints a message indicating that the pocker has stopped.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a red black tree algorithm\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a simple implementation of a red-black tree algorithm in Kotlin:\\n```kotlin\\nclass Node<T> {\\n private var key: Int\\n private var value: T\\n private var left: Node<T>?\\n private var right: Node<T>?\\n private var color: String\\n\\n override fun toString(): String {\\n return \\\"$key: $value\\\"\\n }\\n}\\n\\nclass RedBlackTree<T> {\\n private var root: Node<T>?\\n\\n fun insert(key: Int, value: T): Boolean {\\n val node = root\\n while (node!= null) {\\n if (key < node.key) {\\n node = node.left\\n } else if (key > node.key) {\\n node = node.right\\n } else {\\n return false\\n }\\n }\\n val newNode = Node(key, value)\\n if (root == null) {\\n root = newNode\\n } else {\\n newNode.left = root\\n newNode.right = root\\n root = newNode\\n }\\n return true\\n }\\n\\n fun delete(key: Int): Boolean {\\n val node = root\\n while (node!= null) {\\n if (key < node.key) {\\n node = node.left\\n } else if (key > node.key) {\\n node = node.right\\n } else {\\n if (node.left == null && node.right == null) {\\n if (node.color == \\\"red\\\") {\\n node.color = \\\"black\\\"\\n node.left.color = \\\"red\\\"\\n }\\n root = null\\n } else if (node.left == null) {\\n if (node.color == \\\"red\\\") {\\n node.color = \\\"black\\\"\\n node.right.color = \\\"red\\\"\\n }\\n node = node.right\\n } else if (node.right == null) {\\n if (node.color == \\\"red\\\") {\\n node.color = \\\"black\\\"\\n node.left.color = \\\"red\\\"\\n }\\n node = node.left\\n } else {\\n val successor = findSuccessor(node.right)\\n val temp = successor.key\\n successor.key = node.key\\n node.key = temp\\n delete(temp)\\n }\\n }\\n }\\n return true\\n }\\n\\n private fun findSuccessor(node: Node<T>): Node<T> {\\n val current = node\\n while (current.left!= null) {\\n current = current.left\\n }\\n return current\\n }\\n}\\n```\\nThis implementation defines a `Node` class to represent each node in the red-black tree, and a `RedBlackTree` class to represent the tree itself. The `insert` method inserts a new node into the tree, while the `delete` method deletes a node from the tree. The `findSuccessor` method finds the successor of a given node in the tree.\\n\"},{\"content\":\"write self balancing tree algorithm\",\"role\":\"user\"}]",
parameters: GenerateParameters {
adapter_id: Some("s3://mybucket/model-1253878534445035520/"),
adapter_source: None,
adapter_parameters: None,
api_token: None,
best_of: None,
temperature: Some(1e-7),
repetition_penalty: None,
top_k: None,
top_p: None,
typical_p: None,
do_sample: false,
max_new_tokens: None,
ignore_eos_token: false,
return_full_text: Some(false),
stop: ["<|im_end|>","<|im_end|>"],
truncate: None,
watermark: false,
details: true,
decoder_input_details: false,
return_k_alternatives: None,
apply_chat_template: true,
seed: None,
response_format: None } }}:infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
}
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- Run lorax
- Send chat completion requests with long context
- at some point response streaming hangs
- all next requests fail
Expected behavior
if one request fails consequent request should not be failing.
Stacktrace:
2024-06-22T01:35:24.940390508Z 2024-06-22T01:35:24.940129Z ERROR lorax_launcher: interceptor.py:41 Method Prefill encountered an error.
2024-06-22T01:35:24.940438547Z Traceback (most recent call last):
2024-06-22T01:35:24.940442357Z File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-06-22T01:35:24.940444837Z sys.exit(app())
2024-06-22T01:35:24.940447727Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-06-22T01:35:24.940450087Z return get_command(self)(*args, **kwargs)
2024-06-22T01:35:24.940452947Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-06-22T01:35:24.940455237Z return self.main(*args, **kwargs)
2024-06-22T01:35:24.940457377Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-06-22T01:35:24.940459427Z return _main(
2024-06-22T01:35:24.940461527Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-06-22T01:35:24.940463547Z rv = self.invoke(ctx)
2024-06-22T01:35:24.940465637Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-06-22T01:35:24.940467637Z return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-06-22T01:35:24.940469767Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-06-22T01:35:24.940471797Z return ctx.invoke(self.callback, **ctx.params)
2024-06-22T01:35:24.940473867Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-06-22T01:35:24.940475907Z return __callback(*args, **kwargs)
2024-06-22T01:35:24.940477937Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-06-22T01:35:24.940479937Z return callback(**use_params) # type: ignore
2024-06-22T01:35:24.940481977Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
2024-06-22T01:35:24.940483977Z server.serve(
2024-06-22T01:35:24.940486097Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
2024-06-22T01:35:24.940488187Z asyncio.run(
2024-06-22T01:35:24.940490297Z File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-06-22T01:35:24.940492517Z return loop.run_until_complete(main)
2024-06-22T01:35:24.940494587Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-06-22T01:35:24.940496737Z self.run_forever()
2024-06-22T01:35:24.940498877Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-06-22T01:35:24.940500997Z self._run_once()
2024-06-22T01:35:24.940503147Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-06-22T01:35:24.940505417Z handle._run()
2024-06-22T01:35:24.940507627Z File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-06-22T01:35:24.940509857Z self._context.run(self._callback, *self._args)
2024-06-22T01:35:24.940518256Z File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-06-22T01:35:24.940521216Z return await self.intercept(
2024-06-22T01:35:24.940523476Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-06-22T01:35:24.940525606Z return await response
2024-06-22T01:35:24.940527986Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-06-22T01:35:24.940530426Z raise error
2024-06-22T01:35:24.940532486Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-06-22T01:35:24.940534576Z return await behavior(request_or_iterator, context)
2024-06-22T01:35:24.940538416Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 88, in Prefill
2024-06-22T01:35:24.940540536Z batch = self.model.batch_type.from_pb(
2024-06-22T01:35:24.940542666Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 272, in from_pb
2024-06-22T01:35:24.940544706Z adapter_indices = torch.cat(adapter_indices_list).to(dtype=torch.int64, device=device)
2024-06-22T01:35:24.940550316Z RuntimeError: CUDA error: device-side assert triggered
2024-06-22T01:35:24.940552636Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-06-22T01:35:24.940554636Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-06-22T01:35:24.940556746Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
cc @tgaddair
I did an experiment where I make inference requests sequentially every time using a different adapter it eventually fails every time on this line https://github.com/predibase/lorax/blob/ecbe9eaf714fdfc1d9db86ce947ed7740b0bb918/server/lorax_server/adapters/lora.py#L169
Restarting the server and trying the same failing adapter works. Which means the issue is not with the adapter. There is some issue with how lorax manages adapters in memory maybe?
ERROR lorax_launcher: interceptor.py:41 Method LoadAdapter encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 203, in LoadAdapter
self.model.load_adapter(adapter_parameters, adapter_source, adapter_index, api_token)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/model.py", line 184, in load_adapter
adapter_weights = adapter_config.load_batched_adapter_weights(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/adapters/lora.py", line 54, in load_batched_adapter_weights
return LoraWeights.load(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/adapters/lora.py", line 128, in load
lora_a = lora_a.to(base_device, model.dtype)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR lorax_client: router/client/src/lib.rs:34: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
INFO lorax_router::loader: router/src/loader.rs:207: FAILED loading adapter s3:s3://mybucket/model-1255218038083960832/
INFO lorax_router::queue: router/src/queue.rs:139: set adapter s3:s3://mybucket/model-1255218038083960832/ status to Errored
INFO lorax_router::loader: router/src/loader.rs:277: terminating adapter s3:s3://mybucket/model-1255218038083960832/ loader
thread 'tokio-runtime-worker' panicked at router/src/loader.rs:291:30:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ERROR lorax_launcher: Webserver Crashed
Additionally the Webserver crashes as well
Resolved after updating docker image to latest
No, actually issue is not resolved. The same test running long enough will eventually crash. Now in a different code path
2024-07-03T16:46:05.501747976Z [2m2024-07-03T16:46:05.501171Z[0m [32m INFO[0m [2mlorax_router::loader[0m[2m:[0m [2mrouter/src/loader.rs[0m[2m:[0m[2m198:[0m adapter s3:s3://mybucket/model-1257405854475001856/ loaded
2024-07-03T16:46:05.501803646Z [2m2024-07-03T16:46:05.501204Z[0m [32m INFO[0m [2mlorax_router::queue[0m[2m:[0m [2mrouter/src/queue.rs[0m[2m:[0m[2m139:[0m set adapter s3:s3://mybucket/model-1257405854475001856/ status to Ready
2024-07-03T16:47:00.129769832Z [2m2024-07-03T16:47:00.129526Z[0m [31mERROR[0m [2mlorax_launcher[0m[2m:[0m interceptor.py:41 Method Decode encountered an error.
2024-07-03T16:47:00.129808043Z Traceback (most recent call last):
2024-07-03T16:47:00.129812703Z File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-07-03T16:47:00.129816803Z sys.exit(app())
2024-07-03T16:47:00.129820473Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-07-03T16:47:00.129824183Z return get_command(self)(*args, **kwargs)
2024-07-03T16:47:00.129829023Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-07-03T16:47:00.129831843Z return self.main(*args, **kwargs)
2024-07-03T16:47:00.129834823Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-07-03T16:47:00.129837603Z return _main(
2024-07-03T16:47:00.129840603Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-07-03T16:47:00.129843403Z rv = self.invoke(ctx)
2024-07-03T16:47:00.129846903Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-07-03T16:47:00.129849683Z return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-07-03T16:47:00.129852303Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-07-03T16:47:00.129855043Z return ctx.invoke(self.callback, **ctx.params)
2024-07-03T16:47:00.129857563Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-07-03T16:47:00.129860273Z return __callback(*args, **kwargs)
2024-07-03T16:47:00.129863013Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-07-03T16:47:00.129865813Z return callback(**use_params) # type: ignore
2024-07-03T16:47:00.129868353Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 83, in serve
2024-07-03T16:47:00.129871073Z server.serve(
2024-07-03T16:47:00.129873883Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 309, in serve
2024-07-03T16:47:00.129876754Z asyncio.run(
2024-07-03T16:47:00.129879603Z File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-07-03T16:47:00.129882854Z return loop.run_until_complete(main)
2024-07-03T16:47:00.129885643Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-07-03T16:47:00.129888414Z self.run_forever()
2024-07-03T16:47:00.129891103Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-07-03T16:47:00.129893734Z self._run_once()
2024-07-03T16:47:00.129896594Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-07-03T16:47:00.129899684Z handle._run()
2024-07-03T16:47:00.129902284Z File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-07-03T16:47:00.129905054Z self._context.run(self._callback, *self._args)
2024-07-03T16:47:00.129908914Z File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-07-03T16:47:00.129911794Z return await self.intercept(
2024-07-03T16:47:00.129914464Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-07-03T16:47:00.129917914Z return await response
2024-07-03T16:47:00.129920724Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-07-03T16:47:00.129929134Z raise error
2024-07-03T16:47:00.129932004Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-07-03T16:47:00.129934594Z return await behavior(request_or_iterator, context)
2024-07-03T16:47:00.129937244Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 117, in Decode
2024-07-03T16:47:00.129940024Z generations, next_batch = self.model.generate_token(batch)
2024-07-03T16:47:00.129942784Z File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
2024-07-03T16:47:00.129945294Z return func(*args, **kwds)
2024-07-03T16:47:00.129947824Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 957, in generate_token
2024-07-03T16:47:00.129950564Z out, speculative_logits = self._try_generate_token(batch, adapter_data)
2024-07-03T16:47:00.129953314Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 916, in _try_generate_token
2024-07-03T16:47:00.129956104Z raise e
2024-07-03T16:47:00.129959154Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 913, in _try_generate_token
2024-07-03T16:47:00.129961814Z return self.forward(batch, adapter_data)
2024-07-03T16:47:00.129964384Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 892, in forward
2024-07-03T16:47:00.129967324Z logits = model.forward(
2024-07-03T16:47:00.129970124Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_phi_modeling.py", line 390, in forward
2024-07-03T16:47:00.129973114Z hidden_states = self.model(
2024-07-03T16:47:00.129975854Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-07-03T16:47:00.129978835Z return self._call_impl(*args, **kwargs)
2024-07-03T16:47:00.129981695Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-07-03T16:47:00.129984635Z return forward_call(*args, **kwargs)
2024-07-03T16:47:00.129987835Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_phi_modeling.py", line 338, in forward
2024-07-03T16:47:00.129990484Z cos, sin = self.layers[0].self_attn.rotary_emb.get_cos_sin(position_ids, max_s, hidden_states.dtype)
2024-07-03T16:47:00.129993475Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 971, in get_cos_sin
2024-07-03T16:47:00.129996185Z cos = torch.index_select(self._cos_cached, 0, position_ids)
2024-07-03T16:47:00.129998715Z RuntimeError: CUDA error: device-side assert triggered
2024-07-03T16:47:00.130001325Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T16:47:00.130003935Z
2024-07-03T16:47:00.130006455Z
2024-07-03T16:47:00.130691082Z [2m2024-07-03T16:47:00.130570Z[0m [31mERROR[0m [1mbatch[0m[1m{[0m[3mbatch_size[0m[2m=[0m1[1m}[0m[2m:[0m[1mdecode[0m[2m:[0m[1mdecode[0m[1m{[0m[3msize[0m[2m=[0m1[1m}[0m[2m:[0m[1mdecode[0m[1m{[0m[3msize[0m[2m=[0m1[1m}[0m[2m:[0m [2mlorax_client[0m[2m:[0m [2mrouter/client/src/lib.rs[0m[2m:[0m[2m34:[0m Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
2024-07-03T16:47:00.130703242Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T16:47:00.130707472Z
2024-07-03T16:47:00.133111585Z [2m2024-07-03T16:47:00.133061Z[0m [31mERROR[0m [2mlorax_launcher[0m[2m:[0m interceptor.py:41 Method ClearCache encountered an error.
2024-07-03T16:47:00.133118915Z Traceback (most recent call last):
2024-07-03T16:47:00.133122075Z File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-07-03T16:47:00.133124606Z sys.exit(app())
2024-07-03T16:47:00.133127146Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-07-03T16:47:00.133129386Z return get_command(self)(*args, **kwargs)
2024-07-03T16:47:00.133135726Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-07-03T16:47:00.133138106Z return self.main(*args, **kwargs)
2024-07-03T16:47:00.133140256Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-07-03T16:47:00.133142846Z return _main(
2024-07-03T16:47:00.133145276Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-07-03T16:47:00.133147636Z rv = self.invoke(ctx)
2024-07-03T16:47:00.133149956Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-07-03T16:47:00.133152286Z return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-07-03T16:47:00.133155536Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-07-03T16:47:00.133157896Z return ctx.invoke(self.callback, **ctx.params)
2024-07-03T16:47:00.133160016Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-07-03T16:47:00.133162186Z return __callback(*args, **kwargs)
2024-07-03T16:47:00.133164556Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-07-03T16:47:00.133166866Z return callback(**use_params) # type: ignore
2024-07-03T16:47:00.133168956Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 83, in serve
2024-07-03T16:47:00.133171286Z server.serve(
2024-07-03T16:47:00.133173576Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 309, in serve
2024-07-03T16:47:00.133175976Z asyncio.run(
2024-07-03T16:47:00.133178166Z File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-07-03T16:47:00.133180436Z return loop.run_until_complete(main)
2024-07-03T16:47:00.133183256Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-07-03T16:47:00.133185636Z self.run_forever()
2024-07-03T16:47:00.133188106Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-07-03T16:47:00.133190426Z self._run_once()
2024-07-03T16:47:00.133192616Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-07-03T16:47:00.133194716Z handle._run()
2024-07-03T16:47:00.133196846Z File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-07-03T16:47:00.133199036Z self._context.run(self._callback, *self._args)
2024-07-03T16:47:00.133201726Z File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-07-03T16:47:00.133204276Z return await self.intercept(
2024-07-03T16:47:00.133206536Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-07-03T16:47:00.133208916Z return await response
2024-07-03T16:47:00.133211286Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-07-03T16:47:00.133215396Z raise error
2024-07-03T16:47:00.133217586Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-07-03T16:47:00.133220006Z return await behavior(request_or_iterator, context)
2024-07-03T16:47:00.133222447Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 55, in ClearCache
2024-07-03T16:47:00.133224636Z self.cache.delete(request.id)
2024-07-03T16:47:00.133226787Z File "/opt/conda/lib/python3.10/site-packages/lorax_server/cache.py", line 40, in delete
2024-07-03T16:47:00.133229116Z torch.cuda.empty_cache()
2024-07-03T16:47:00.133231247Z File "/opt/conda/lib/python3.10/site-packages/torch/cuda/memory.py", line 162, in empty_cache
2024-07-03T16:47:00.133233347Z torch._C._cuda_emptyCache()
2024-07-03T16:47:00.133235676Z RuntimeError: CUDA error: device-side assert triggered
2024-07-03T16:47:00.133237867Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-03T16:47:00.133240017Z
2024-07-03T16:47:00.133242087Z
@magdyksaleh can you have a look. I was able to catch it on predibase cloud as well.
Hey @yunmanger1, I'll try and repro this today. In the meantime, if there's any additional info you can provide to help with the repro, please let me know. For example:
- Exact inputs (I see at least one request in the first post has inputs, but don't see any inputs for other requests)
- Adapter details (rank, target modules)