Jiatong (Julius) Han comments

Results 216 comments of


                                            Jiatong (Julius) Han

trafficstars

Problem with stable diffusion training

Try python3 main.py --logdir /tmp -t --postfix test -b configs/train_colossalai_cifar10.yaml --placement_policy cuda It was likely due to a previous version of defaulting to auto placement, which often introduced tensor device...

使用双卡推理出现问题

How did you set `--nproc_per_node=gpu`? I cannot see where is the `gpu` defined and it is supposed to be a number that does not exceed 2. Other than that, I...

AssertionError while inferencing with multiple gpus

Please stick to even-sized `nproc_per_node` for now (or setting it to `1`). The reason was the temporal dimension of the DiT attention block is of `16` which is not divisible...

inference sample issue:File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) TimeoutError: timed out

Can you use commands such as `ping -c 4 8.8.8.8` to check if you internet connection is available?

the gradient of all parameters is None

It should only be `None` after `optimizer.zero_grad()`; `booster.backward` was doing `torch.optim.Optimizer.backward(loss)`. Would you mind printing the contents of `loss` to see if it is `NaN`?

Cannot install apex

Thanks for sharing your solution. And for cross-referencing, this issue was similar to issue #258.

errors happened in the inference process

Can you `pip install --upgrade flash-attn --no-build-isolation`?

Out-of-memory for default config.

I am gonna close this issue since it appears to have been resolved by the question owner.

bad inference result

#550

question: docker

Yes, you may use docker to build training or inference environment. For windows, you might want to use WSL to maybe get around with Docker.