Jiatong (Julius) Han
Jiatong (Julius) Han
No. This method should be able to gather optimiser states before saving.
It seems that one of the workers failed:
Hello, the actual reason was when using [dummy dataset](https://github.com/hpcaitech/ColossalAI/blob/a020eecc7051083e1dbc4a02bd49a9521b032aad/examples/language/gpt/titans/dataset/webtext.py#L35), data is generated randomly and it does not make sense to use multiple workers to load data from anywhere. Multiple workers...
Through some preliminary checks, it is to do with line 354: ` model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True) `.
It is highly likely that your cpu runs out of memory due to exitcode (-9). Try allocating more main memory to your trial and run again.
No. Import the one below `from colossalai.nn.parallel.utils import get_static_torch_model` And add `model = get_static_torch_model(model)` before saving.
Were you using zero optimizer at your last run? ZeroOptimizer class does not have `state` variable (it has `optim_state`), so there is a mismatch with the checkpoint.
Based on your second question, you should've found our saving and loading ckpt utilities. [This line](https://github.com/hpcaitech/ColossalAI/blob/36a519b49f44a536d4ad9b1041ffc610c0aa1bba/colossalai/utils/checkpoint/module_checkpoint.py#L103) has already gathered the tensors and wiped out inter-device differences before we save at...
It can be due to a conflicting version of 'subprocess32' installed via pip. Check below out. ``` pip uninstall subprocess32 conda install -c conda-forge subprocess32 ```
#3146 should be a related issue. And we should solve these together.