Hi,

Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel

python -m lamorel_launcher.launch --config-path /home/yanxue/Grounding/experiments/configs --config-name local_gpu_config rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py and my config is


lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: true
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 1
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 1
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    minibatch_size: 3
    parallelism:
      use_gpu: true
      model_parallelism_size: 1
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
  updater_args:

But I meet the error:

[2023-09-14 20:45:32,837][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 3946796) of binary: /home/yanxue/anaconda3/envs/dlp/bin/python Error executing job with overrides: ['rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py'] Traceback (most recent call last): File "/home/yanxue/Grounding/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main launch_command(accelerate_args) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command multi_gpu_launcher(args) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher distrib_run.run(args) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/yanxue/Grounding/experiments/train_language_agent.py FAILED

Failures: [1]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3946797) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3946796) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you kindly suggest why the error happen?

Sep 14 '23 13:09 yanxue7

and I run the example/ppo_finetuning for BabyAI-MixedTrainLocal enviroment in lamorel with modified config

rl_script_args:
  path: ???
  name_environment: 'BabyAI-MixedTrainLocal'

  #'BabyAI-GoToRedBall-v0'
  #'BabyAI-MixedTrainLocal'
  #'BabyAI-GoToRedBall-v0'
  #'BabyAI-MixedTestLocal'
  #'BabyAI-GoToRedBall-v0'
  epochs: 1000
  steps_per_epoch: 1500
  minibatch_size: 64
  gradient_batch_size: 16
  ppo_epochs: 4
  lam: 0.99
  gamma: 0.99
  target_kl: 0.01
  max_ep_len: 1000
  lr: 1e-4
  entropy_coef: 0.01
  value_loss_coef: 0.5
  clip_eps: 0.2
  max_grad_norm: 0.5
  save_freq: 100
  output_dir: "/home/yanxue/lamoral/pposmalltrain"

But it seems failed to train, it only gets score around 0.2 less than 0.6 in your paper

Sep 14 '23 13:09 yanxue7

Hi,

Concerning your first issue, the stack trace you provided misses the real issue that happened so I can't tell. Anyway accelerate has some difficulties with launching two processes on a single machine with only 1 GPU. This is why we provided a custom version of accelerate (which is outdated). Could you try these two PRs (1, 2) please? Or manually launch the two processes as shown in Lamorel's documentation.

Concerning your second issue, this is weird. Let me try to launch some experiments and find out what happens.

Sep 27 '23 07:09 ClementRomac

Grounding_LLMs_with_online_RL
Grounding_LLMs_with_online_RL copied to clipboard

How to run the train_language_agent.py without using slurm

/home/yanxue/Grounding/experiments/train_language_agent.py FAILED

Failures: [1]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3946797) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3946796) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Grounding_LLMs_with_online_RL Grounding_LLMs_with_online_RL copied to clipboard

How to run the train_language_agent.py without using slurm

/home/yanxue/Grounding/experiments/train_language_agent.py FAILED

Failures: [1]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3946797) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3946796) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Grounding_LLMs_with_online_RL
Grounding_LLMs_with_online_RL copied to clipboard