botbw
botbw
> @botbw I have written a min repro with a simple network and in this case the keys actually match! I will take a closer look at my code and...
hey folks, I'm closing this issue since it's been stalled for a while, feel free to reopen or submit a new issue if you still have any doubts
@mahaocong90 Thanks for reporting this! I tested the script with 24xH20 GPUs (2304 GiB mem in total), and the script works fine on my side (at least for dozens of...
Hi @happynaruto, will it be possible to provide the script you ran? Assuming that you are doing some fine-tuning, I tested using `examples/language/llama/benchmark.py` by adding config ```python "qwen": Qwen2Config( hidden_act="silu",...
@Gautam-Rajeev I think master node env vars will be synced to the rest: https://github.com/hpcaitech/ColossalAI/blob/6d676ee0e95d54df90b4ee640dee0e0a198ab8f3/colossalai/cli/launcher/run.py#L280-L287 https://github.com/hpcaitech/ColossalAI/blob/6d676ee0e95d54df90b4ee640dee0e0a198ab8f3/colossalai/cli/launcher/multinode_runner.py#L47-L53 You might want to change the code a bit to allow different NCCL_SOCKET_IFNAME, or simply...
Hey @fangxintao, it looks like something might be wrong with padding. Will it be possible to provide a minimal reproduction or the entire traceback?
Hi @fincherjc, I also spotted this issue when running MLPerf Storage. After applying your patch to commit `v2.0`. I can find `num_files_train` in `metadata.json`, whereas the `run` command still uses...
> There may be a valid idea for improvement to have training run detect and execute on the full dataset, or to autoscale num_files based on the number of accelerators...
@pbelevich Thanks, I believe this is a correct fix for bullet 1, but could you explain a bit where the hanging issue happened and why this resolves it?