OpenRLHF
OpenRLHF copied to clipboard
HTTPError when running train_ppo_llama_ray.sh
What happened + What you expected to happen:
Operation process:
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
Success start head:
Usage stats collection is enabled. To disable this, add
--disable-usage-stats
to the command that starts the cluster, or run the following command:ray disable-usage-stats
before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 0.0.0.0
Ray runtime started.
Next steps To add another node to this Ray cluster, run ray start --address='0.0.0.0:6379'
To connect to this Ray cluster: import ray ray.init(_node_ip_address='0.0.0.0')
To submit a Ray job using the Ray Jobs CLI: RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run ray stop
To view the status of the cluster, use ray status
To monitor and debug Ray, view the dashboard at 127.0.0.1:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
My Configuration
set -x export PATH=$HOME/.local/bin/:$PATH
ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}'
-- python3 examples/train_ppo_ray.py
--ref_num_nodes 1
--ref_num_gpus_per_node 1
--reward_num_nodes 1
--reward_num_gpus_per_node 1
--critic_num_nodes 1
--critic_num_gpus_per_node 2
--actor_num_nodes 1
--actor_num_gpus_per_node 4
--pretrain OpenLLMAI/Llama-2-7b-sft-model-ocra-500k
--reward_pretrain OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt
--save_path /openrlhf/examples/test_scripts/ckpt/7b_llama
--micro_train_batch_size 8
--train_batch_size 128
--micro_rollout_batch_size 16
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data Open-Orca/OpenOrca
--prompt_data_probs 1
--max_samples 80000
--normalize_reward
--actor_init_on_gpu
--adam_offload
--flash_attn
--gradient_checkpointing
--use_wandb {wandb_token}
Error Information
Traceback (most recent call last): File "/root/miniconda3/envs/lzy/bin/ray", line 33, in
sys.exit(load_entry_point('ray==2.12.0', 'console_scripts', 'ray')()) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main return cli() File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper return func(*args, **kwargs) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 264, in submit client = _get_sdk_client( File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 29, in _get_sdk_client client = JobSubmissionClient( File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/sdk.py", line 109, in init self._check_connection_and_version( File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version self._check_connection_and_version_with_url(min_version, version_error_message) File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 267, in _check_connection_and_version_with_url r.raise_for_status() File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://127.0.0.1:8265/api/version