Jiatong (Julius) Han comments

Results 216 comments of


                                            Jiatong (Julius) Han

trafficstars

[BUG]: SaveCheckpointHook does not save optimizer and scheduler parameters

No. This method should be able to gather optimiser states before saving.

[BUG]: an instance of 'c10::CUDAErrorc10::CUDAError' initialization error

It seems that one of the workers failed:

[BUG]: an instance of 'c10::CUDAErrorc10::CUDAError' initialization error

Hello, the actual reason was when using [dummy dataset](https://github.com/hpcaitech/ColossalAI/blob/a020eecc7051083e1dbc4a02bd49a9521b032aad/examples/language/gpt/titans/dataset/webtext.py#L35), data is generated randomly and it does not make sense to use multiple workers to load data from anywhere. Multiple workers...

[BUG]: Timeout

Through some preliminary checks, it is to do with line 354: ` model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True) `.

[BUG]: Timeout

It is highly likely that your cpu runs out of memory due to exitcode (-9). Try allocating more main memory to your trial and run again.

[BUG]: How can i save checkpoint trained by the way of GeminiDDP

No. Import the one below `from colossalai.nn.parallel.utils import get_static_torch_model` And add `model = get_static_torch_model(model)` before saving.

[BUG]: stable diffusion add --resume run error

Were you using zero optimizer at your last run? ZeroOptimizer class does not have `state` variable (it has `optim_state`), so there is a mismatch with the checkpoint.

[BUG]: stable diffusion add --resume run error

Based on your second question, you should've found our saving and loading ckpt utilities. [This line](https://github.com/hpcaitech/ColossalAI/blob/36a519b49f44a536d4ad9b1041ffc610c0aa1bba/colossalai/utils/checkpoint/module_checkpoint.py#L103) has already gathered the tensors and wiped out inter-device differences before we save at...

[BUG]: Diffusion env problem

It can be due to a conflicting version of 'subprocess32' installed via pip. Check below out. ``` pip uninstall subprocess32 conda install -c conda-forge subprocess32 ```

[FEATURE]: Any plan to support train_dreambooth_colossalai with train_text_encoder?

#3146 should be a related issue. And we should solve these together.