Thorin Farnsworth comments

Results 30 comments of


                                            Thorin Farnsworth

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

And you've tried this without `mpiexec -n 8`? Just `python cm_train.py`....etc.

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

What happens if you do `dist.get_world_size()`? How many GPUs does the machine you are using have? You can also check this with `nvidia-smi` in terminal.

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

https://github.com/openai/consistency_models/blob/6d26080c58244555c031dbc63080c0961af74200/cm/train_util.py#L98 Here in the training loop is where distributed training is selected. It seems to activate assuming CUDA is available, rather than CUDA && Multi-GPU. I am unsure whether [DDP](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)...

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

You need to change the `if th.cuda.is_available():`, here you are only change the attribute and DDP is still used. You could try changing the [backend on DDP to not use...

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

Image sample doesn't require a dataset from what I can see. It would be odd for it to be required.

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

If you look at the above code you sent. In the `else` condition the `ddp_model` is set to be the original model, I think this is what you want. The...

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

> Simply deleting `mpiexec -n 8` will run the code in one single GPU, as I mentioned in the issue #20 . I think the issue is that without NCCL...

Use one gpu to generate images using a pretrained model without the communication protocol nccl.

`mpiexec` or `mpirun` are pretty simple, I would definitely recommend learning to use them or trying to use the other launchers I mentioned above. You basically need to have something...

Inconsistent loss term with paper

There are some differences in the paper in general. For example the rho scheduling is reversed in the paper, the formula used in the code here is more similar to...

Inconsistent loss term with paper

There are quite a few differences. I've raised an additional ticket and email Yang Song to hopefully know which is better. https://github.com/openai/consistency_models/issues/18 I would need to check, but the scaling...