ghostplant

Results 272 comments of ghostplant

You can run the docker commands with: `docker run -e LOCAL_SIZE=4 -it --rm --net=host ..` to reduce the GPU counts. Setting `LOCAL_SIZE=2` should also work for A100(80G) x 2. However,...

Got it. Please re-pull the image to skip the downloading procedure: ```sh docker pull tutelgroup/deepseek-671b:a100x8-chat-20250723 ```

Yes.. it prints the question prompts as well for now.

@squirrelfish The next image version has removed the prefill strings in response: ```sh docker run -e LOCAL_SIZE=8 -it --rm --ipc=host --net=host --shm-size=8g \ --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -v /:/host...

May I know if "Nanotron" is still active? I try deploying it for Tutel integration, but the nanotron fails even under uv environment. Is there any docker environment that is...

Thank you. This command doesn't seem to get into issues: `nsys profile --trace-fork-before-exec=true -o tutel_fail.nsys python3 -m torch.distributed.run --nproc-per-node=2 -m tutel.examples.helloworld` Tutel's initialization still uses torch's naive distributed initialization, but...

Hello, Megatron-LM already includes a non-dynamic component that supports several MoE functionalities. However, since Megatron's expert parameter placement is static, and coupled with a set of Megatron’s predefined static parallelism...

"only admin", really? Currently there is no user so you should be able to bypass the login phase directly.

It should be compatible for Blackwell but strategies are not optimized for it. How many cross-node machines do you use? And how many local GPUs are there per machines?

The docker instance is not tested under blackwell for few months. Will dive into this in a day.