aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

AWS SageMaker Multiple GPU Training Fails

Open cderinbogaz opened this issue 4 years ago • 6 comments

Hello,

Running aitextgen fine tuning gives the following error:

AttributeError: Can't pickle local object 'get_linear_schedule_with_warmup.<locals>.lr_lambda'

Running on ml.p3.8xlarge instance, I believe this is something related to ddp settings from pytorch_lightning.

cderinbogaz avatar May 31 '21 09:05 cderinbogaz

I believe this is something related to ddp settings from pytorch_lightning.

Probably a good guess.

I'm less familiar with how pytorch-lightning works in this case, i.e. may not be a problem on my end. Although maybe adding an option to disable/specify scheduling could work.

minimaxir avatar Jun 02 '21 03:06 minimaxir

Actually it would be great if there would be a setting to change this part of the code which is at aitextgen.py line 748:

if n_gpu > 1: train_params["distributed_backend"] = "ddp"

Instead of automatically setting it to 'ddp' it would be helpful to have options to choose from such as 'ddp_spawn' or 'dp' to troubleshoot.

Other forms of parallelism is explained here at the pytorch lightning docs as the following:

Distributed modes Lightning allows multiple ways of training

  • Data Parallel (accelerator='dp') (multiple-gpus, 1 machine)

  • DistributedDataParallel (accelerator='ddp') (multiple-gpus across many machines (python script based)).

  • DistributedDataParallel (accelerator='ddp_spawn') (multiple-gpus across many machines (spawn based)).

  • DistributedDataParallel 2 (accelerator='ddp2') (DP in a machine, DDP across machines).

  • Horovod (accelerator='horovod') (multi-machine, multi-gpu, configured at runtime)

  • TPUs (tpu_cores=8|x) (tpu or TPU pod)

What do you think?

cderinbogaz avatar Jun 02 '21 11:06 cderinbogaz

I believe this is something related to ddp settings from pytorch_lightning.

Probably a good guess.

I'm less familiar with how pytorch-lightning works in this case, i.e. may not be a problem on my end. Although maybe adding an option to disable/specify scheduling could work.

We basically can't use aitextgen to train larger models until this is fixed, because they don't fit onto a single gpu

hugbubby avatar Jun 06 '21 20:06 hugbubby

Has this not been addressed yet / is there no workaround?

Alx-AI avatar Jun 11 '21 02:06 Alx-AI

hello there. I had the same issue as well. I was trying to train the 1558M gpt2 model but always got that error when loading. Further up in the error message it says that the ddb_spawn strategy was chosen by default. I worked around that by changing the "fallback strategy" to ddb. All you need to do is edit the python lightning module (~/.local/lib/python3.x/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py) line 853. (self.distributed_backend = DistributedType.DDP_SPAWN). Just pick a type for distribution that works better. I am not sure if that is a "good" solution, but at least it is some sort of workaround. Hope that helps.

decurus avatar Jan 16 '22 00:01 decurus

kaggle never has this issue

breadbrowser avatar Jul 15 '22 02:07 breadbrowser