lorax
lorax copied to clipboard
Fails hard on CUDA error
System Info
We are using streaming v1 chat completions API. After some amount of requests or a request with large enough context lorax server fails to respond. And all consequent requests also fail.
infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
we are running it in docker with 1 GPU on A100 PCIe runpod.io:
lorax-launcher --model-id microsoft/phi-2 --adapter-source s3 --compile --dtype bfloat16 --port 3000 --revision ef382358ec9e382308935a992d908de099b64c23 --max-input-length 2000 --max-total-tokens 2048 --env
2024-06-22T01:38:49.630259Z INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.74.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Jun 22 01:38:49 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:E1:00.0 Off | 0 |
| N/A 34C P0 61W / 300W | 71234MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
2024-06-22T01:38:49.630346Z INFO lorax_launcher: Args { model_id: "microsoft/phi-2", adapter_id: None, source: "hub", adapter_source: "s3", revision: Some("ef382358ec9e382308935a992d908de099b64c23"), validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: true, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "960a5e26c0d7", port: 3000, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false }
full request log:
2024-06-22T01:12:03.526879786Z 2024-06-22T01:12:03.526727Z ERROR HTTP request{
otel.name=POST
/v1/chat/completions
http.flavor=1.1
http.method=POST
http.route=/v1/chat/completions
http.scheme=HTTP
http.target=/v1/chat/completions
http.user_agent=Ktor
client
otel.kind=server
trace_id=e68f52322fc88977fb39f91db1970199
http.status_code=200 otel.status_code="OK"
}:chat_completions_v1{default_return_full_text=Extension(false) info=Extension(Info {
model_id: "microsoft/phi-2",
model_sha: Some("ef382358ec9e382308935a992d908de099b64c23"),
model_dtype: "torch.bfloat16",
model_device_type: "cuda",
model_pipeline_tag: Some("text-generation"),
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_input_length: 2000,
max_total_tokens: 2048,
waiting_served_ratio: 1.2,
max_batch_total_tokens: 188144,
max_waiting_tokens: 20,
validation_workers: 2,
version: "0.1.0",
sha: None,
docker_label: None,
request_logger_url: None }
) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x55bd96f750e0,
tail_position: 0 },
semaphore: Semaphore { semaphore: Semaphore { permits: 32 },
bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } })
req_headers={
"user-agent": "Ktor client",
"content-length": "5341",
"accept": "text/event-stream,application/json",
"accept-charset": "UTF-8",
"accept-encoding": "gzip,
br",
"authorization": "Bearer",
"cache-control": "no-cache",
"cdn-loop": "cloudflare",
"cf-ipcountry": "US",
"cf-ray": "89785e2c4ce3ce40-SJC",
"cf-visitor": "{\"scheme\":\"https\"}",
"content-type": "application/json", "x-forwarded-for": "", "x-forwarded-host": "some.proxy.runpod.net",
"x-forwarded-proto": "https"}}:async_stream:generate_stream{
request=GenerateRequest { inputs: "[{\"content\":\"You are a helpful assistant. Write your answers using markdown markup.\",\"role\":\"system\"},{\"content\":\"write a quick sort in kotlin\",\"role\":\"user\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a quick sort implementation in Kotlin:\\n```kotlin\\nfun quickSort(array: Array<Int>): Array<Int> {\\n if (array.size <= 1) {\\n return array\\n }\\n var pivot = array[array.size / 2]\\n var left = Array<Int>()\\n var right = Array<Int>()\\n for (i in 0 until array.size) {\\n if (array[i] < pivot) {\\n left += array[i]\\n } else {\\n right += array[i]\\n }\\n }\\n return quickSort(left) + Array(pivot) + quickSort(right)\\n}\\n```\\nThis implementation uses the divide-and-conquer approach to sort an array. It selects a pivot element and partitions the array into two sub-arrays, one containing elements smaller than the pivot and the other containing elements greater than the pivot. It then recursively sorts the sub-arrays and combines them with the pivot element to obtain the sorted array.\\n\"},{\"content\":\"write an SQL query to select all users that were active this year\",\"role\":\"user\"},{\"content\":\"Sure, here's an SQL query that selects all users that were active this year:\\n```sql\\nSELECT * FROM users\\nWHERE active_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)\\nAND active_date <= CURDATE();\\n```\\nThis query selects all rows from the `users` table where the `active_date` column is greater than or equal to one year ago from the current date and less than or equal to the current date. This will return all users that were active this year.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a pocker simulator in kotlin\"},{\"content\":\"Sure, here's a simple pocker simulator in Kotlin:\\n```kotlin\\nfun main(args: Array<String>) {\\n val pocker = Pocker()\\n val pocker.start()\\n val pocker.stop()\\n println(\\\"Pocker stopped.\\\")\\n}\\n```\\nThis implementation creates a new instance of the `Pocker` class and starts the pocker process. It then stops the pocker process and prints a message indicating that the pocker has stopped.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a red black tree algorithm\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a simple implementation of a red-black tree algorithm in Kotlin:\\n```kotlin\\nclass Node<T> {\\n private var key: Int\\n private var value: T\\n private var left: Node<T>?\\n private var right: Node<T>?\\n private var color: String\\n\\n override fun toString(): String {\\n return \\\"$key: $value\\\"\\n }\\n}\\n\\nclass RedBlackTree<T> {\\n private var root: Node<T>?\\n\\n fun insert(key: Int, value: T): Boolean {\\n val node = root\\n while (node!= null) {\\n if (key < node.key) {\\n node = node.left\\n } else if (key > node.key) {\\n node = node.right\\n } else {\\n return false\\n }\\n }\\n val newNode = Node(key, value)\\n if (root == null) {\\n root = newNode\\n } else {\\n newNode.left = root\\n newNode.right = root\\n root = newNode\\n }\\n return true\\n }\\n\\n fun delete(key: Int): Boolean {\\n val node = root\\n while (node!= null) {\\n if (key < node.key) {\\n node = node.left\\n } else if (key > node.key) {\\n node = node.right\\n } else {\\n if (node.left == null && node.right == null) {\\n if (node.color == \\\"red\\\") {\\n node.color = \\\"black\\\"\\n node.left.color = \\\"red\\\"\\n }\\n root = null\\n } else if (node.left == null) {\\n if (node.color == \\\"red\\\") {\\n node.color = \\\"black\\\"\\n node.right.color = \\\"red\\\"\\n }\\n node = node.right\\n } else if (node.right == null) {\\n if (node.color == \\\"red\\\") {\\n node.color = \\\"black\\\"\\n node.left.color = \\\"red\\\"\\n }\\n node = node.left\\n } else {\\n val successor = findSuccessor(node.right)\\n val temp = successor.key\\n successor.key = node.key\\n node.key = temp\\n delete(temp)\\n }\\n }\\n }\\n return true\\n }\\n\\n private fun findSuccessor(node: Node<T>): Node<T> {\\n val current = node\\n while (current.left!= null) {\\n current = current.left\\n }\\n return current\\n }\\n}\\n```\\nThis implementation defines a `Node` class to represent each node in the red-black tree, and a `RedBlackTree` class to represent the tree itself. The `insert` method inserts a new node into the tree, while the `delete` method deletes a node from the tree. The `findSuccessor` method finds the successor of a given node in the tree.\\n\"},{\"content\":\"write self balancing tree algorithm\",\"role\":\"user\"}]",
parameters: GenerateParameters {
adapter_id: Some("s3://mybucket/model-1253878534445035520/"),
adapter_source: None,
adapter_parameters: None,
api_token: None,
best_of: None,
temperature: Some(1e-7),
repetition_penalty: None,
top_k: None,
top_p: None,
typical_p: None,
do_sample: false,
max_new_tokens: None,
ignore_eos_token: false,
return_full_text: Some(false),
stop: ["<|im_end|>","<|im_end|>"],
truncate: None,
watermark: false,
details: true,
decoder_input_details: false,
return_k_alternatives: None,
apply_chat_template: true,
seed: None,
response_format: None } }}:infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
}
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- Run lorax
- Send chat completion requests with long context
- at some point response streaming hangs
- all next requests fail
Expected behavior
if one request fails consequent request should not be failing.