aitextgen
aitextgen copied to clipboard
AWS SageMaker Multiple GPU Training Fails
Hello,
Running aitextgen fine tuning gives the following error:
AttributeError: Can't pickle local object 'get_linear_schedule_with_warmup.<locals>.lr_lambda'
Running on ml.p3.8xlarge instance, I believe this is something related to ddp settings from pytorch_lightning.
I believe this is something related to ddp settings from pytorch_lightning.
Probably a good guess.
I'm less familiar with how pytorch-lightning works in this case, i.e. may not be a problem on my end. Although maybe adding an option to disable/specify scheduling could work.
Actually it would be great if there would be a setting to change this part of the code which is at aitextgen.py line 748:
if n_gpu > 1: train_params["distributed_backend"] = "ddp"
Instead of automatically setting it to 'ddp' it would be helpful to have options to choose from such as 'ddp_spawn' or 'dp' to troubleshoot.
Other forms of parallelism is explained here at the pytorch lightning docs as the following:
Distributed modes Lightning allows multiple ways of training
-
Data Parallel (accelerator='dp') (multiple-gpus, 1 machine)
-
DistributedDataParallel (accelerator='ddp') (multiple-gpus across many machines (python script based)).
-
DistributedDataParallel (accelerator='ddp_spawn') (multiple-gpus across many machines (spawn based)).
-
DistributedDataParallel 2 (accelerator='ddp2') (DP in a machine, DDP across machines).
-
Horovod (accelerator='horovod') (multi-machine, multi-gpu, configured at runtime)
-
TPUs (tpu_cores=8|x) (tpu or TPU pod)
What do you think?
I believe this is something related to ddp settings from pytorch_lightning.
Probably a good guess.
I'm less familiar with how pytorch-lightning works in this case, i.e. may not be a problem on my end. Although maybe adding an option to disable/specify scheduling could work.
We basically can't use aitextgen to train larger models until this is fixed, because they don't fit onto a single gpu
Has this not been addressed yet / is there no workaround?
hello there. I had the same issue as well. I was trying to train the 1558M gpt2 model but always got that error when loading. Further up in the error message it says that the ddb_spawn strategy was chosen by default. I worked around that by changing the "fallback strategy" to ddb. All you need to do is edit the python lightning module (~/.local/lib/python3.x/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py) line 853. (self.distributed_backend = DistributedType.DDP_SPAWN). Just pick a type for distribution that works better. I am not sure if that is a "good" solution, but at least it is some sort of workaround. Hope that helps.
kaggle never has this issue