Bihan Rana comments

Results 34 comments of


                                            Bihan Rana

Only 8 of 64 GPUs Are Fully Partitioned and Usable in Docker After CPX/NPS4

Thank you @maxweiss. I will try with `ROCm 6.4.0`

Only 8 of 64 GPUs Are Fully Partitioned and Usable in Docker After CPX/NPS4

@maxweiss You are right with ROCm 6.4.0 it shows all PARTITION with valid values, but only devices (indices 0, 8, 16, 24, 32, 40, 48, 56) are attachable via Docker’s...

Only 8 of 64 GPUs Are Fully Partitioned and Usable in Docker After CPX/NPS4

@maxweiss Once again Thank You. Yes this looks like just a display error. I ran vllm inference and it worked too. I also tried with `ROCm 6.4.1` and it worked...

[Installation]: VLLM on ARM machine with GH200

To successfully run vLLM on the GH200, we followed these steps: ``` docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3 # Inside the container $ pip3 install --pre torch torchvision...

[Examples] Improve the AMD example

@peterschmidt85 1. For Inference examples we do not need to build manually. (I will update) 2. For Fine-tuning (I will check and let you know) 3. Yes I will update...

[AMD] Support default Docker image for AMD

> Currently, we require the user to specify `image` always when using AMD. It would be cool if we provide a small and up-to-date AMD image with ROCm drivers. @peterschmidt85...

[Bug]: Service re-run terminates despite available fleet capacity.

[gateway_logs.txt](https://github.com/user-attachments/files/24261039/gateway_logs.txt) @jvstme Here is the gateway logs around that time

[Bug]: Service re-run terminates despite available fleet capacity.

> These are the gateway logs about replica `23d3e9` that the server failed to register: > > ``` > Dec 19 13:03:19 ip-172-31-21-247 sh[29500]: 2025-12-19 13:03:19,068 - dstack._internal.proxy.gateway.services.registry - DEBUG...

Add replica groups in dstack-service

Will be solving merge conflicts as review continues.

Add replica groups in dstack-service

## Related PRs https://github.com/dstackai/dstack/pull/3205 from @DragonStuff