lorax Fails hard on CUDA error

Fails hard on CUDA error

Open yunmanger1 opened this issue 8 months ago • 7 comments

System Info

We are using streaming v1 chat completions API. After some amount of requests or a request with large enough context lorax server fails to respond. And all consequent requests also fail.

infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered

we are running it in docker with 1 GPU on A100 PCIe runpod.io:

lorax-launcher --model-id microsoft/phi-2 --adapter-source s3  --compile --dtype bfloat16  --port 3000 --revision ef382358ec9e382308935a992d908de099b64c23 --max-input-length 2000 --max-total-tokens 2048 --env
2024-06-22T01:38:49.630259Z  INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.74.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Jun 22 01:38:49 2024
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA A100 80GB PCIe          On  | 00000000:E1:00.0 Off |                    0 |
   | N/A   34C    P0              61W / 300W |  71234MiB / 81920MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+
2024-06-22T01:38:49.630346Z  INFO lorax_launcher: Args { model_id: "microsoft/phi-2", adapter_id: None, source: "hub", adapter_source: "s3", revision: Some("ef382358ec9e382308935a992d908de099b64c23"), validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: true, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "960a5e26c0d7", port: 3000, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false }

full request log:

2024-06-22T01:12:03.526879786Z 2024-06-22T01:12:03.526727Z ERROR HTTP request{
otel.name=POST
/v1/chat/completions
http.flavor=1.1
http.method=POST
http.route=/v1/chat/completions
http.scheme=HTTP
http.target=/v1/chat/completions
http.user_agent=Ktor
client
otel.kind=server
trace_id=e68f52322fc88977fb39f91db1970199
http.status_code=200 otel.status_code="OK"
}:chat_completions_v1{default_return_full_text=Extension(false) info=Extension(Info {

model_id: "microsoft/phi-2",
model_sha: Some("ef382358ec9e382308935a992d908de099b64c23"),
model_dtype: "torch.bfloat16",
model_device_type: "cuda",
model_pipeline_tag: Some("text-generation"),
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_input_length: 2000,
max_total_tokens: 2048,
waiting_served_ratio: 1.2,
max_batch_total_tokens: 188144,
max_waiting_tokens: 20,
validation_workers: 2,
version: "0.1.0",
sha: None,
docker_label: None,
request_logger_url: None }

) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x55bd96f750e0,
tail_position: 0 },
semaphore: Semaphore { semaphore: Semaphore { permits: 32 },
bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } })

req_headers={
"user-agent": "Ktor client",
"content-length": "5341",
"accept": "text/event-stream,application/json",
"accept-charset": "UTF-8",
"accept-encoding": "gzip,
br",
"authorization": "Bearer",
"cache-control": "no-cache",
"cdn-loop": "cloudflare",
"cf-ipcountry": "US",
"cf-ray": "89785e2c4ce3ce40-SJC",
"cf-visitor": "{\"scheme\":\"https\"}",
"content-type": "application/json", "x-forwarded-for": "", "x-forwarded-host": "some.proxy.runpod.net",
"x-forwarded-proto": "https"}}:async_stream:generate_stream{

request=GenerateRequest { inputs: "[{\"content\":\"You are a helpful assistant. Write your answers using markdown markup.\",\"role\":\"system\"},{\"content\":\"write a quick sort in kotlin\",\"role\":\"user\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a quick sort implementation in Kotlin:\\n```kotlin\\nfun quickSort(array: Array<Int>): Array<Int> {\\n    if (array.size <= 1) {\\n        return array\\n    }\\n    var pivot = array[array.size / 2]\\n    var left = Array<Int>()\\n    var right = Array<Int>()\\n    for (i in 0 until array.size) {\\n        if (array[i] < pivot) {\\n            left += array[i]\\n        } else {\\n            right += array[i]\\n        }\\n    }\\n    return quickSort(left) + Array(pivot) + quickSort(right)\\n}\\n```\\nThis implementation uses the divide-and-conquer approach to sort an array. It selects a pivot element and partitions the array into two sub-arrays, one containing elements smaller than the pivot and the other containing elements greater than the pivot. It then recursively sorts the sub-arrays and combines them with the pivot element to obtain the sorted array.\\n\"},{\"content\":\"write an SQL query to select all users that were active this year\",\"role\":\"user\"},{\"content\":\"Sure, here's an SQL query that selects all users that were active this year:\\n```sql\\nSELECT * FROM users\\nWHERE active_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)\\nAND active_date <= CURDATE();\\n```\\nThis query selects all rows from the `users` table where the `active_date` column is greater than or equal to one year ago from the current date and less than or equal to the current date. This will return all users that were active this year.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a pocker simulator in kotlin\"},{\"content\":\"Sure, here's a simple pocker simulator in Kotlin:\\n```kotlin\\nfun main(args: Array<String>) {\\n    val pocker = Pocker()\\n    val pocker.start()\\n    val pocker.stop()\\n    println(\\\"Pocker stopped.\\\")\\n}\\n```\\nThis implementation creates a new instance of the `Pocker` class and starts the pocker process. It then stops the pocker process and prints a message indicating that the pocker has stopped.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a red black tree algorithm\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a simple implementation of a red-black tree algorithm in Kotlin:\\n```kotlin\\nclass Node<T> {\\n    private var key: Int\\n    private var value: T\\n    private var left: Node<T>?\\n    private var right: Node<T>?\\n    private var color: String\\n\\n    override fun toString(): String {\\n        return \\\"$key: $value\\\"\\n    }\\n}\\n\\nclass RedBlackTree<T> {\\n    private var root: Node<T>?\\n\\n    fun insert(key: Int, value: T): Boolean {\\n        val node = root\\n        while (node!= null) {\\n            if (key < node.key) {\\n                node = node.left\\n            } else if (key > node.key) {\\n                node = node.right\\n            } else {\\n                return false\\n            }\\n        }\\n        val newNode = Node(key, value)\\n        if (root == null) {\\n            root = newNode\\n        } else {\\n            newNode.left = root\\n            newNode.right = root\\n            root = newNode\\n        }\\n        return true\\n    }\\n\\n    fun delete(key: Int): Boolean {\\n        val node = root\\n        while (node!= null) {\\n            if (key < node.key) {\\n                node = node.left\\n            } else if (key > node.key) {\\n                node = node.right\\n            } else {\\n                if (node.left == null && node.right == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.left.color = \\\"red\\\"\\n                    }\\n                    root = null\\n                } else if (node.left == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.right.color = \\\"red\\\"\\n                    }\\n                    node = node.right\\n                } else if (node.right == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.left.color = \\\"red\\\"\\n                    }\\n                    node = node.left\\n                } else {\\n                    val successor = findSuccessor(node.right)\\n                    val temp = successor.key\\n                    successor.key = node.key\\n                    node.key = temp\\n                    delete(temp)\\n                }\\n            }\\n        }\\n        return true\\n    }\\n\\n    private fun findSuccessor(node: Node<T>): Node<T> {\\n        val current = node\\n        while (current.left!= null) {\\n            current = current.left\\n        }\\n        return current\\n    }\\n}\\n```\\nThis implementation defines a `Node` class to represent each node in the red-black tree, and a `RedBlackTree` class to represent the tree itself. The `insert` method inserts a new node into the tree, while the `delete` method deletes a node from the tree. The `findSuccessor` method finds the successor of a given node in the tree.\\n\"},{\"content\":\"write self balancing tree algorithm\",\"role\":\"user\"}]",
parameters: GenerateParameters {
adapter_id: Some("s3://mybucket/model-1253878534445035520/"),
adapter_source: None,
adapter_parameters: None,
api_token: None,
best_of: None,
temperature: Some(1e-7),
repetition_penalty: None,
top_k: None,
top_p: None,
typical_p: None,
do_sample: false,
max_new_tokens: None,
ignore_eos_token: false,
return_full_text: Some(false),
stop: ["<|im_end|>","<|im_end|>"],
truncate: None,
watermark: false,
details: true,
decoder_input_details: false,
return_k_alternatives: None,
apply_chat_template: true,
seed: None,
response_format: None } }}:infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
}

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Run lorax
Send chat completion requests with long context
at some point response streaming hangs
all next requests fail

Expected behavior

if one request fails consequent request should not be failing.

Jun 22 '24 01:06 yunmanger1

lorax lorax copied to clipboard

Fails hard on CUDA error

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard