verl ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception

this is my scripts: data_path=/home/work/dataset/data/Countdown-Tasks-3to4 PYTHONUNBUFFERED=1 HYDRA_FULL_ERROR=1 CUDA_LAUNCH_BLOCKING=1 WORKING_DIR=${WORKING_DIR:-"${PWD}"} RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/recipe/gkd/runtime_env.yaml"} ray job submit --no-wait \ --working-dir "${WORKING_DIR}" \ --runtime-env "${RUNTIME_ENV}" \ -- python3 -m recipe.gkd.main_gkd \ data.train_files=${data_path}/train.parquet \ data.val_files=${data_path}/test.parquet \ algorithm.adv_estimator=grpo \ data.train_batch_size=32 \ data.max_prompt_length=1024 \ data.max_response_length=2048 \ actor_rollout_ref.model.path=/home/work/dpskv3-19b \ actor_rollout_ref.actor.strategy=megatron \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.use_dynamic_bsz=True \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=1 \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=/home/work/dpskv3-19b-mcore \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ algorithm.kl_ctrl.kl_coef=0.001 \ trainer.logger=['console'] \ trainer.val_before_train=False \ trainer.default_hdfs_dir=null \ trainer.n_gpus_per_node=4 \ trainer.nnodes=1 \ trainer.save_freq=1000 \ trainer.test_freq=10 \ trainer.total_epochs=15 2>&1 | tee verl_demo.log and this is my log 'Error executing job with overrides: ['data.train_files=/home/work/dataset/data/Countdown-Tasks-3to4/train.parquet', 'data.val_files=/home/work/dataset/data/Countdown-Tasks-3to4/test.parquet', 'algorithm.adv_estimator=grpo', 'data.train_batch_size=32', 'data.max_prompt_length=1024', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=/home/work/dpskv3-19b', 'actor_rollout_ref.actor.strategy=megatron', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.use_dynamic_bsz=True', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=1', 'actor_rollout_ref.actor.megatron.use_dist_checkpointing=True', 'actor_rollout_ref.actor.megatron.dist_checkpointing_path=/home/work/dpskv3-19b-mcore', 'actor_rollout_ref.actor.ppo_mini_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.enforce_eager=False', 'actor_rollout_ref.rollout.free_cache_engine=False', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[console]', 'trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=1000', 'trainer.test_freq=10', 'trainer.total_epochs=15'] Traceback (most recent call last): File "/tmp/ray/session_2025-06-30_16-31-18_671673_76553/runtime_resources/working_dir_files/_ray_pkg_d8c87cc5822ec27f/recipe/gkd/main_gkd.py", line 64, in main run_ppo(config) File "/tmp/ray/session_2025-06-30_16-31-18_671673_76553/runtime_resources/working_dir_files/_ray_pkg_d8c87cc5822ec27f/recipe/gkd/main_gkd.py", line 76, in run_ppo ray.get(runner.run.remote(config)) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2849, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 937, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RaySystemError): [36mray::TaskRunner.run()[39m (pid=94233, ip=10.215.192.14, actor_id=59bf205668df8e5da8d6a5d203000000, repr=<main_gkd.TaskRunner object at 0x7f9f0b948850>) File "/tmp/ray/session_2025-06-30_16-31-18_671673_76553/runtime_resources/working_dir_files/_ray_pkg_d8c87cc5822ec27f/recipe/gkd/main_gkd.py", line 182, in run

Jun 30 '25 08:06 lilei199908

我也遇到这个问题，请问怎么解决？

Jul 04 '25 06:07 shenshaowei

我也遇到这个问题，请问怎么解决？

Jul 04 '25 06:07 shenshaowei

Same problem

Aug 01 '25 07:08 jiangzizi

Same problem

Aug 01 '25 07:08 jiangzizi

same error

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/cfs_cloud_code/daoyichen/data/BytedTsinghua-SIA/DAPO-Math-17k/data/dapo-math-17k.parquet', 'data.val_files=/cfs_cloud_code/daoyichen/data/ReTool-AIME-2024/data/aime-2024.parquet', 'data.prompt_key=prompt', 'data.truncation=left', 'data.return_raw_chat=True', 'data.max_prompt_length=2048', 'data.max_response_length=4096', 'data.train_batch_size=64', 'actor_rollout_ref.rollout.n=4', 'algorithm.adv_estimator=grpo', 'algorithm.use_kl_in_reward=False', 'algorithm.kl_ctrl.kl_coef=0.0', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.actor.kl_loss_coef=0.0', 'actor_rollout_ref.actor.clip_ratio_low=0.2', 'actor_rollout_ref.actor.clip_ratio_high=0.28', 'actor_rollout_ref.actor.clip_ratio_c=10.0', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.model.path=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//Qwen3-235B-A22B-Thinking-2507', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.optim.lr_warmup_steps=10', 'actor_rollout_ref.actor.optim.weight_decay=0.1', 'actor_rollout_ref.actor.ppo_mini_batch_size=16', 'actor_rollout_ref.actor.megatron.param_offload=False', 'actor_rollout_ref.actor.megatron.optimizer_offload=False', 'actor_rollout_ref.actor.megatron.grad_offload=False', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.megatron.context_parallel_size=8', 'actor_rollout_ref.actor.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.dist_checkpointing_path=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//Qwen3-235B-A22B-Thinking-2507-mcore-256', 'actor_rollout_ref.actor.megatron.use_dist_checkpointing=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=5', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=5', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.actor.optim.clip_grad=1.0', 'actor_rollout_ref.actor.loss_agg_mode=token-mean', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.65', 'actor_rollout_ref.rollout.tensor_model_parallel_size=16', 'actor_rollout_ref.rollout.enable_chunked_prefill=True', 'actor_rollout_ref.rollout.max_num_batched_tokens=6144', 'actor_rollout_ref.rollout.temperature=1.0', 'actor_rollout_ref.rollout.top_p=1.0', 'actor_rollout_ref.rollout.top_k=-1', 'actor_rollout_ref.rollout.val_kwargs.temperature=1.0', 'actor_rollout_ref.rollout.val_kwargs.top_p=0.7', 'actor_rollout_ref.rollout.val_kwargs.top_k=-1', 'actor_rollout_ref.rollout.val_kwargs.do_sample=True', 'actor_rollout_ref.rollout.val_kwargs.n=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.mode=async', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.ref.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.param_offload=False', 'actor_rollout_ref.ref.megatron.dist_checkpointing_path=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//Qwen3-235B-A22B-Thinking-2507-mcore-256', 'actor_rollout_ref.ref.megatron.use_dist_checkpointing=True', 'reward_model.reward_manager=dapo', '+reward_model.reward_kwargs.overlong_buffer_cfg.enable=True', '+reward_model.reward_kwargs.overlong_buffer_cfg.len=4096', '+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=0.1', '+reward_model.reward_kwargs.overlong_buffer_cfg.log=False', '+reward_model.reward_kwargs.max_resp_len=4096', 'trainer.logger=["console","tensorboard"]', 'trainer.project_name=qwen235b', 'trainer.experiment_name=DAPO-Qwen3-236b-megatron-512gpus', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=32', 'trainer.val_before_train=False', 'trainer.test_freq=10', 'trainer.save_freq=20', 'trainer.total_epochs=10', 'trainer.total_training_steps=100', 'trainer.default_local_dir=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//outputs/qwen235b/DAPO-Qwen3-236b-megatron-512gpus', 'trainer.resume_mode=auto', 'trainer.log_val_generations=10']
Traceback (most recent call last):
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/main_ppo.py", line 40, in main
    run_ppo(config)
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/main_ppo.py", line 77, in run_ppo
    ray.get(runner.run.remote(config))
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::TaskRunner.run() (pid=203428, ip=28.48.1.228, actor_id=16fc140d06ce5d48c19e5bcd02000000, repr=<main_ppo.TaskRunner object at 0x7f36659450a0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/main_ppo.py", line 244, in run
    trainer.fit()
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/ppo/ray_trainer.py", line 1213, in fit
    old_log_prob = self.actor_rollout_wg.compute_log_prob(batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/single_controller/ray/base.py", line 50, in __call__
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 51, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BackendCompilerFailed.__init__() missing 1 required positional argument: 'inner_exception'

The above exception was the direct cause of the following exception:

ray::TaskRunner.run() (pid=203428, ip=28.48.1.228, actor_id=16fc140d06ce5d48c19e5bcd02000000, repr=<main_ppo.TaskRunner object at 0x7f36659450a0>)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 54, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

---------------------------------------
Job 'raysubmit_VnhST4eF26j3VACz' failed
---------------------------------------

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 54, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Aug 20 '25 08:08 dawson-chen

same error

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/cfs_cloud_code/daoyichen/data/BytedTsinghua-SIA/DAPO-Math-17k/data/dapo-math-17k.parquet', 'data.val_files=/cfs_cloud_code/daoyichen/data/ReTool-AIME-2024/data/aime-2024.parquet', 'data.prompt_key=prompt', 'data.truncation=left', 'data.return_raw_chat=True', 'data.max_prompt_length=2048', 'data.max_response_length=4096', 'data.train_batch_size=64', 'actor_rollout_ref.rollout.n=4', 'algorithm.adv_estimator=grpo', 'algorithm.use_kl_in_reward=False', 'algorithm.kl_ctrl.kl_coef=0.0', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.actor.kl_loss_coef=0.0', 'actor_rollout_ref.actor.clip_ratio_low=0.2', 'actor_rollout_ref.actor.clip_ratio_high=0.28', 'actor_rollout_ref.actor.clip_ratio_c=10.0', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.model.path=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//Qwen3-235B-A22B-Thinking-2507', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.optim.lr_warmup_steps=10', 'actor_rollout_ref.actor.optim.weight_decay=0.1', 'actor_rollout_ref.actor.ppo_mini_batch_size=16', 'actor_rollout_ref.actor.megatron.param_offload=False', 'actor_rollout_ref.actor.megatron.optimizer_offload=False', 'actor_rollout_ref.actor.megatron.grad_offload=False', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.megatron.context_parallel_size=8', 'actor_rollout_ref.actor.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.dist_checkpointing_path=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//Qwen3-235B-A22B-Thinking-2507-mcore-256', 'actor_rollout_ref.actor.megatron.use_dist_checkpointing=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=5', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=5', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.actor.optim.clip_grad=1.0', 'actor_rollout_ref.actor.loss_agg_mode=token-mean', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.65', 'actor_rollout_ref.rollout.tensor_model_parallel_size=16', 'actor_rollout_ref.rollout.enable_chunked_prefill=True', 'actor_rollout_ref.rollout.max_num_batched_tokens=6144', 'actor_rollout_ref.rollout.temperature=1.0', 'actor_rollout_ref.rollout.top_p=1.0', 'actor_rollout_ref.rollout.top_k=-1', 'actor_rollout_ref.rollout.val_kwargs.temperature=1.0', 'actor_rollout_ref.rollout.val_kwargs.top_p=0.7', 'actor_rollout_ref.rollout.val_kwargs.top_k=-1', 'actor_rollout_ref.rollout.val_kwargs.do_sample=True', 'actor_rollout_ref.rollout.val_kwargs.n=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.mode=async', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.ref.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.param_offload=False', 'actor_rollout_ref.ref.megatron.dist_checkpointing_path=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//Qwen3-235B-A22B-Thinking-2507-mcore-256', 'actor_rollout_ref.ref.megatron.use_dist_checkpointing=True', 'reward_model.reward_manager=dapo', '+reward_model.reward_kwargs.overlong_buffer_cfg.enable=True', '+reward_model.reward_kwargs.overlong_buffer_cfg.len=4096', '+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=0.1', '+reward_model.reward_kwargs.overlong_buffer_cfg.log=False', '+reward_model.reward_kwargs.max_resp_len=4096', 'trainer.logger=["console","tensorboard"]', 'trainer.project_name=qwen235b', 'trainer.experiment_name=DAPO-Qwen3-236b-megatron-512gpus', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=32', 'trainer.val_before_train=False', 'trainer.test_freq=10', 'trainer.save_freq=20', 'trainer.total_epochs=10', 'trainer.total_training_steps=100', 'trainer.default_local_dir=/apdcephfs_zwfy2/share_304053830/hunyuan/daoyichen//outputs/qwen235b/DAPO-Qwen3-236b-megatron-512gpus', 'trainer.resume_mode=auto', 'trainer.log_val_generations=10']
Traceback (most recent call last):
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/main_ppo.py", line 40, in main
    run_ppo(config)
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/main_ppo.py", line 77, in run_ppo
    ray.get(runner.run.remote(config))
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::TaskRunner.run() (pid=203428, ip=28.48.1.228, actor_id=16fc140d06ce5d48c19e5bcd02000000, repr=<main_ppo.TaskRunner object at 0x7f36659450a0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/main_ppo.py", line 244, in run
    trainer.fit()
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/trainer/ppo/ray_trainer.py", line 1213, in fit
    old_log_prob = self.actor_rollout_wg.compute_log_prob(batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cfs_cloud_code/daoyichen/SearchAgent-RL/verl/single_controller/ray/base.py", line 50, in __call__
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 51, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BackendCompilerFailed.__init__() missing 1 required positional argument: 'inner_exception'

The above exception was the direct cause of the following exception:

ray::TaskRunner.run() (pid=203428, ip=28.48.1.228, actor_id=16fc140d06ce5d48c19e5bcd02000000, repr=<main_ppo.TaskRunner object at 0x7f36659450a0>)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 54, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

---------------------------------------
Job 'raysubmit_VnhST4eF26j3VACz' failed
---------------------------------------

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 45, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/exceptions.py", line 54, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Aug 20 '25 08:08 dawson-chen

same error

Aug 28 '25 12:08 helldog-star

same error

Aug 28 '25 12:08 helldog-star

My ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception is followed by TypeError: BackendCompilerFailed.init() missing 1 required positional argument: 'inner_exception' 👎

Nov 07 '25 06:11 OrientalHorizon

My ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception is followed by TypeError: BackendCompilerFailed.init() missing 1 required positional argument: 'inner_exception' 👎

Nov 07 '25 06:11 OrientalHorizon

same error

Nov 25 '25 09:11 hj1993

same error

Nov 25 '25 09:11 hj1993