Jiatong (Julius) Han
Jiatong (Julius) Han
Try python3 main.py --logdir /tmp -t --postfix test -b configs/train_colossalai_cifar10.yaml --placement_policy cuda It was likely due to a previous version of defaulting to auto placement, which often introduced tensor device...
How did you set `--nproc_per_node=gpu`? I cannot see where is the `gpu` defined and it is supposed to be a number that does not exceed 2. Other than that, I...
Please stick to even-sized `nproc_per_node` for now (or setting it to `1`). The reason was the temporal dimension of the DiT attention block is of `16` which is not divisible...
Can you use commands such as `ping -c 4 8.8.8.8` to check if you internet connection is available?
It should only be `None` after `optimizer.zero_grad()`; `booster.backward` was doing `torch.optim.Optimizer.backward(loss)`. Would you mind printing the contents of `loss` to see if it is `NaN`?
Thanks for sharing your solution. And for cross-referencing, this issue was similar to issue #258.
Can you `pip install --upgrade flash-attn --no-build-isolation`?
I am gonna close this issue since it appears to have been resolved by the question owner.
#550
Yes, you may use docker to build training or inference environment. For windows, you might want to use WSL to maybe get around with Docker.