Grounding_LLMs_with_online_RL
Grounding_LLMs_with_online_RL copied to clipboard
How to run the train_language_agent.py without using slurm
Hi,
Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel
python -m lamorel_launcher.launch --config-path /home/yanxue/Grounding/experiments/configs --config-name local_gpu_config rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py
and my config is
lamorel_args:
log_level: info
allow_subgraph_use_whith_gradient: true
distributed_setup_args:
n_rl_processes: 1
n_llm_processes: 1
accelerate_args:
config_file: accelerate/default_config.yaml
machine_rank: 0
num_machines: 1
llm_args:
model_type: seq2seq
model_path: t5-small
pretrained: true
minibatch_size: 3
parallelism:
use_gpu: true
model_parallelism_size: 1
synchronize_gpus_after_scoring: false
empty_cuda_cache_after_scoring: false
updater_args:
But I meet the error:
[2023-09-14 20:45:32,837][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 3946796) of binary: /home/yanxue/anaconda3/envs/dlp/bin/python Error executing job with overrides: ['rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py'] Traceback (most recent call last): File "/home/yanxue/Grounding/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main launch_command(accelerate_args) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command multi_gpu_launcher(args) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher distrib_run.run(args) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/yanxue/Grounding/experiments/train_language_agent.py FAILED
Failures: [1]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3946797) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3946796) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Could you kindly suggest why the error happen?
and I run the example/ppo_finetuning for BabyAI-MixedTrainLocal enviroment in lamorel with modified config
rl_script_args:
path: ???
name_environment: 'BabyAI-MixedTrainLocal'
#'BabyAI-GoToRedBall-v0'
#'BabyAI-MixedTrainLocal'
#'BabyAI-GoToRedBall-v0'
#'BabyAI-MixedTestLocal'
#'BabyAI-GoToRedBall-v0'
epochs: 1000
steps_per_epoch: 1500
minibatch_size: 64
gradient_batch_size: 16
ppo_epochs: 4
lam: 0.99
gamma: 0.99
target_kl: 0.01
max_ep_len: 1000
lr: 1e-4
entropy_coef: 0.01
value_loss_coef: 0.5
clip_eps: 0.2
max_grad_norm: 0.5
save_freq: 100
output_dir: "/home/yanxue/lamoral/pposmalltrain"
But it seems failed to train, it only gets score around 0.2 less than 0.6 in your paper
Hi,
Concerning your first issue, the stack trace you provided misses the real issue that happened so I can't tell. Anyway accelerate has some difficulties with launching two processes on a single machine with only 1 GPU. This is why we provided a custom version of accelerate (which is outdated). Could you try these two PRs (1, 2) please? Or manually launch the two processes as shown in Lamorel's documentation.
Concerning your second issue, this is weird. Let me try to launch some experiments and find out what happens.