lorax Second GPU is not found when running --sharded true

trafficstars

System Info

Lorax version: 0.4.1 Lorax_launcher: 0.1.0 Model: mistralai/Mixtral-8x7B-Instruct-v0.1 GPUS: 3090 (24 gb) 3060 (12 gb)

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

model= mistralai/Mixtral-8x7B-Instruct-v0.1 volume=$PWD/data

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true

Error Message: 2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES Error: NotEnoughCUDADevices("sharded is true but only found 1 CUDA devices")

Expected behavior

The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi

''' +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A | | 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+ '''

I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output

''' +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A | | 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB | +---------------------------------------------------------------------------------------+ '''

It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error

Dec 24 '23 07:12 psych0v0yager

Hey @psych0v0yager, apologies for the late reply, I've been out on holiday.

My first suspicion is that PyTorch isn't able to discover the device for some reason. Can you try running the following and sharing the output:

python -c "import torch; print(torch.cuda.device_count())"

Dec 30 '23 21:12 tgaddair

No worries! I ran the command in my conda environment and received the following output.

python -c "import torch; print(torch.cuda.device_count())" 2

Dec 30 '23 23:12 psych0v0yager

Was that command run from within the lorax Docker container or outside of it? If you ran it outside the container, it would be worth testing it from within the container (by running docker exec -it <container_id> /bin/bash to SSH in) to see if it gives different results.

Another thing you can try is setting --num_shard 2 explicitly. If it's unable to find the second GPU with that arg, it should hopefully raise a more useful error.

Dec 31 '23 05:12 tgaddair

Okay here are my results.

Testing the command in the docker container gave the same result as outside, it detected 2 containers.

Furthermore num-shard worked as well, the model was split over 2 layers. Here is the exact command I ran

model= TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ volume=$PWD/data sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize awq --max-batch-prefill-tokens 512 --max-input-length 512 --num-shard 2

However, the container errored out with the following message:

2023-12-31T07:11:04.818168Z INFO lorax_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: Some(Awq), dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 512, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 512, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "7875655a60f8", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2023-12-31T07:11:04.818186Z WARN lorax_launcher: trust_remote_code is set. Trusting that model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ do not contain malicious code. 2023-12-31T07:11:04.818189Z INFO lorax_launcher: Sharding model on 2 processes 2023-12-31T07:11:04.818250Z INFO download: lorax_launcher: Starting download process. 2023-12-31T07:11:07.026335Z INFO lorax_launcher: cli.py:103 Files are already present on the host. Skipping download.

2023-12-31T07:11:07.320835Z INFO download: lorax_launcher: Successfully downloaded weights. 2023-12-31T07:11:07.320990Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2023-12-31T07:11:07.321025Z INFO shard-manager: lorax_launcher: Starting shard rank=1 2023-12-31T07:11:17.329108Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=1 2023-12-31T07:11:17.329108Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0 2023-12-31T07:11:17.607159Z INFO lorax_launcher: server.py:269 Server started at unix:///tmp/lorax-server-0

2023-12-31T07:11:17.629335Z INFO shard-manager: lorax_launcher: Shard ready in 10.307929251s rank=0 2023-12-31T07:11:17.706473Z INFO lorax_launcher: server.py:269 Server started at unix:///tmp/lorax-server-1

2023-12-31T07:11:17.729408Z INFO shard-manager: lorax_launcher: Shard ready in 10.407867503s rank=1 2023-12-31T07:11:17.828920Z INFO lorax_launcher: Starting Webserver 2023-12-31T07:11:18.336074Z WARN lorax_router: router/src/main.rs:356: --revision is not set 2023-12-31T07:11:18.336086Z WARN lorax_router: router/src/main.rs:357: We strongly advise to set it to a known supported commit. 2023-12-31T07:11:18.447140Z INFO lorax_router: router/src/main.rs:378: Serving revision 9afb6f0a7d7fe9ecebdda1baa4ff4e13e73e97d7 of model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ 2023-12-31T07:11:18.466622Z INFO lorax_router: router/src/main.rs:216: Warming up model 2023-12-31T07:11:18.498551Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error. Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 864, in warmup _, batch = self.generate_token(batch) File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 963, in generate_token raise e File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 960, in generate_token out = self.forward(batch, adapter_data) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 408, in forward logits = self.model.forward( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 978, in forward hidden_states = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 911, in forward hidden_states = self.embed_tokens(input_ids) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 609, in forward out = torch.nn.functional.embedding(input, self.weight) File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 11.76 GiB of which 4.19 MiB is free. Process 12234 has 11.74 GiB memory in use. Of the allocated memory 11.52 GiB is allocated by PyTorch, and 13.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept return await response File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 74, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 867, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 512 prefill tokens. You need to decrease --max-batch-prefill-tokens

2023-12-31T07:11:18.498837Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 512 prefill tokens. You need to decrease --max-batch-prefill-tokens 2023-12-31T07:12:19.522290Z ERROR warmup{max_input_length=512 max_prefill_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: transport error Error: Warmup(Generation("transport error")) 2023-12-31T07:12:19.581935Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. [E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=2097152, NumelOut=2097152, Timeout(ms)=60000) ran for 60334 milliseconds before timing out. rank=0 2023-12-31T07:12:19.581956Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=0 2023-12-31T07:12:19.590594Z ERROR lorax_launcher: Shard 0 crashed 2023-12-31T07:12:19.590617Z INFO lorax_launcher: Terminating webserver 2023-12-31T07:12:19.590628Z INFO lorax_launcher: Waiting for webserver to gracefully shutdown 2023-12-31T07:12:19.590652Z INFO lorax_launcher: webserver terminated 2023-12-31T07:12:19.590659Z INFO lorax_launcher: Shutting down shards 2023-12-31T07:12:19.936191Z INFO shard-manager: lorax_launcher: Shard terminated rank=1 Error: ShardFailed

This is the output from nvidia-smi:

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 0% 51C P2 38W / 170W | 12040MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A | | 0% 61C P2 152W / 350W | 12251MiB / 24576MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2306 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12234 C /opt/conda/bin/python3.10 12026MiB | | 1 N/A N/A 2306 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 12233 C /opt/conda/bin/python3.10 12234MiB | +---------------------------------------------------------------------------------------+

It appears as if the container is trying to split the model evenly over both GPUs and it is filling up the 3060, while the 3090 still has a lot of space left over. Is there a way to change the way the layers are split so the 3090 takes the larger chunk? Which part of LoRAX is responsible for sharding?

Dec 31 '23 07:12 psych0v0yager

Hey @psych0v0yager, that's an interesting scenario. It might be a little tricky (though not impossible) to divide the weights differently across the GPUs.

LoRAX uses tensor parallelism, so we slice tensors along dimensions when loading them and then aggregate the results of computations at certain points during the forward pass. To make this work the way you're describing, we would need a way to chunk the tensors more granularly, and then assign different workers a different number of chunks based on how much GPU memory they have available.

Here are the various tensor parallel layer implementations: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/layers.py

And here you can see the logic that shards the weights: https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/weights.py#L353

These would roughly be the sections of the code that would need to change for this.

Jan 03 '24 00:01 tgaddair

Thank you @tgaddair for the reply. I will look at those sections, it does seem a bit tricky.

Meanwhile I was looking at the following code here

https://github.com/predibase/lorax/blob/main/server/lorax_server/utils/dist.py

I was wondering if it would be simpler to keep the existing tensor parallelism and instead shard the model into 3 slices, putting 2 slices on the 3090 and one slice on the 3060. That way none of the tensor parallelism would need to be rewritten.

If I wanted to implement this, what portions of the code would I need to modify?

Jan 03 '24 07:01 psych0v0yager

lorax lorax copied to clipboard

Second GPU is not found when running --sharded true

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard