lorax
lorax copied to clipboard
Second GPU is not found when running --sharded true
System Info
Lorax version: 0.4.1 Lorax_launcher: 0.1.0 Model: mistralai/Mixtral-8x7B-Instruct-v0.1 GPUS: 3090 (24 gb) 3060 (12 gb)
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
model= mistralai/Mixtral-8x7B-Instruct-v0.1 volume=$PWD/data
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true
Error Message:
2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
Error: NotEnoughCUDADevices("sharded is true but only found 1 CUDA devices")
Expected behavior
The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi
''' +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A | | 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+ '''
I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output
''' +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A | | 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB | +---------------------------------------------------------------------------------------+ '''
It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error
Hey @psych0v0yager, apologies for the late reply, I've been out on holiday.
My first suspicion is that PyTorch isn't able to discover the device for some reason. Can you try running the following and sharing the output:
python -c "import torch; print(torch.cuda.device_count())"
No worries! I ran the command in my conda environment and received the following output.
python -c "import torch; print(torch.cuda.device_count())"
2
Was that command run from within the lorax Docker container or outside of it? If you ran it outside the container, it would be worth testing it from within the container (by running docker exec -it <container_id> /bin/bash to SSH in) to see if it gives different results.
Another thing you can try is setting --num_shard 2 explicitly. If it's unable to find the second GPU with that arg, it should hopefully raise a more useful error.
Okay here are my results.
Testing the command in the docker container gave the same result as outside, it detected 2 containers.
Furthermore num-shard worked as well, the model was split over 2 layers. Here is the exact command I ran
model= TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
volume=$PWD/data
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize awq --max-batch-prefill-tokens 512 --max-input-length 512 --num-shard 2
However, the container errored out with the following message:
2023-12-31T07:11:04.818168Z INFO lorax_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: Some(Awq), dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 512, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 512, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "7875655a60f8", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2023-12-31T07:11:04.818186Z WARN lorax_launcher: trust_remote_code is set. Trusting that model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ do not contain malicious code.
2023-12-31T07:11:04.818189Z INFO lorax_launcher: Sharding model on 2 processes
2023-12-31T07:11:04.818250Z INFO download: lorax_launcher: Starting download process.
2023-12-31T07:11:07.026335Z INFO lorax_launcher: cli.py:103 Files are already present on the host. Skipping download.
2023-12-31T07:11:07.320835Z INFO download: lorax_launcher: Successfully downloaded weights. 2023-12-31T07:11:07.320990Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2023-12-31T07:11:07.321025Z INFO shard-manager: lorax_launcher: Starting shard rank=1 2023-12-31T07:11:17.329108Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=1 2023-12-31T07:11:17.329108Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2023-12-31T07:11:17.607159Z INFO lorax_launcher: server.py:269 Server started at unix:///tmp/lorax-server-0
2023-12-31T07:11:17.629335Z INFO shard-manager: lorax_launcher: Shard ready in 10.307929251s rank=0 2023-12-31T07:11:17.706473Z INFO lorax_launcher: server.py:269 Server started at unix:///tmp/lorax-server-1
2023-12-31T07:11:17.729408Z INFO shard-manager: lorax_launcher: Shard ready in 10.407867503s rank=1
2023-12-31T07:11:17.828920Z INFO lorax_launcher: Starting Webserver
2023-12-31T07:11:18.336074Z WARN lorax_router: router/src/main.rs:356: --revision is not set
2023-12-31T07:11:18.336086Z WARN lorax_router: router/src/main.rs:357: We strongly advise to set it to a known supported commit.
2023-12-31T07:11:18.447140Z INFO lorax_router: router/src/main.rs:378: Serving revision 9afb6f0a7d7fe9ecebdda1baa4ff4e13e73e97d7 of model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
2023-12-31T07:11:18.466622Z INFO lorax_router: router/src/main.rs:216: Warming up model
2023-12-31T07:11:18.498551Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 864, in warmup
_, batch = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 963, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 960, in generate_token
out = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 408, in forward
logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 978, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 911, in forward
hidden_states = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 609, in forward
out = torch.nn.functional.embedding(input, self.weight)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 11.76 GiB of which 4.19 MiB is free. Process 12234 has 11.74 GiB memory in use. Of the allocated memory 11.52 GiB is allocated by PyTorch, and 13.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept return await response File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 74, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 867, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 512 prefill tokens. You need to decrease
--max-batch-prefill-tokens
2023-12-31T07:11:18.498837Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 512 prefill tokens. You need to decrease --max-batch-prefill-tokens
2023-12-31T07:12:19.522290Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-12-31T07:12:19.581935Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out. rank=0
2023-12-31T07:12:19.581956Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=0
2023-12-31T07:12:19.590594Z ERROR lorax_launcher: Shard 0 crashed
2023-12-31T07:12:19.590617Z INFO lorax_launcher: Terminating webserver
2023-12-31T07:12:19.590628Z INFO lorax_launcher: Waiting for webserver to gracefully shutdown
2023-12-31T07:12:19.590652Z INFO lorax_launcher: webserver terminated
2023-12-31T07:12:19.590659Z INFO lorax_launcher: Shutting down shards
2023-12-31T07:12:19.936191Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
Error: ShardFailed
This is the output from nvidia-smi:
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 0% 51C P2 38W / 170W | 12040MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A | | 0% 61C P2 152W / 350W | 12251MiB / 24576MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2306 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12234 C /opt/conda/bin/python3.10 12026MiB | | 1 N/A N/A 2306 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 12233 C /opt/conda/bin/python3.10 12234MiB | +---------------------------------------------------------------------------------------+
It appears as if the container is trying to split the model evenly over both GPUs and it is filling up the 3060, while the 3090 still has a lot of space left over. Is there a way to change the way the layers are split so the 3090 takes the larger chunk? Which part of LoRAX is responsible for sharding?
Hey @psych0v0yager, that's an interesting scenario. It might be a little tricky (though not impossible) to divide the weights differently across the GPUs.
LoRAX uses tensor parallelism, so we slice tensors along dimensions when loading them and then aggregate the results of computations at certain points during the forward pass. To make this work the way you're describing, we would need a way to chunk the tensors more granularly, and then assign different workers a different number of chunks based on how much GPU memory they have available.
Here are the various tensor parallel layer implementations: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/layers.py
And here you can see the logic that shards the weights: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/weights.py#L353
These would roughly be the sections of the code that would need to change for this.
Thank you @tgaddair for the reply. I will look at those sections, it does seem a bit tricky.
Meanwhile I was looking at the following code here
https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/dist.py
I was wondering if it would be simpler to keep the existing tensor parallelism and instead shard the model into 3 slices, putting 2 slices on the 3090 and one slice on the 3060. That way none of the tensor parallelism would need to be rewritten.
If I wanted to implement this, what portions of the code would I need to modify?