verl icon indicating copy to clipboard operation
verl copied to clipboard

Tried to run main_generation.py, but it raised KeyError: ConfigAttributeError: Key 'actor' is not in struct.

Open Di-viner opened this issue 10 months ago • 5 comments

Thanks for the great work! Tried to run main_generation.py, but got a KeyError: ConfigAttributeError: Key 'actor' is not in struct.

similar to examples/generation/run_deepseek_v2_lite_math.sh:

python3 -m verl.trainer.main_generation \
    trainer.nnodes=1 \
    trainer.n_gpus_per_node=2 \
    data.path=... \
    data.prompt_key=prompt \
    data.n_samples=1 \
    data.output_path=... \
    model.path=... \
    +model.trust_remote_code=True \
    rollout.temperature=1.0 \
    rollout.top_k=50 \
    rollout.top_p=0.7 \
    rollout.prompt_length=2048 \
    rollout.response_length=2048 \
    rollout.tensor_model_parallel_size=2 \
    rollout.gpu_memory_utilization=0.8

but got a KeyError:

Traceback (most recent call last):
  File "/verl/verl/trainer/main_generation.py", line 68, in main
    wg.init_model()
  File "/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
  File "/verl/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/verl/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/verl/lib/python3.9/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/verl/lib/python3.9/site-packages/ray/_private/worker.py", line 908, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::5ZAuQSActorRolloutRefWorker_0:1:ActorRolloutRefWorker.__init__() (pid=328776, ip=10.17.176.67, actor_id=08452910c3c019b8e00ed93801000000, repr=<verl.workers.fsdp_workers.ActorRolloutRefWorker object at 0x7ff00c3f2850>)
  File "/verl/verl/workers/fsdp_workers.py", line 88, in __init__
    self.device_mesh = create_device_mesh(world_size=world_size, fsdp_size=self.config.actor.fsdp_config.fsdp_size)
  File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 359, in __getattr__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/verl/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 819, in format_and_raise
    _raise(ex, cause)
  File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/verl/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 475, in _get_node
    self._validate_get(key)
  File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get
    self._format_and_raise(
  File "/verl/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
omegaconf.errors.ConfigAttributeError: Key 'actor' is not in struct
    full_key: actor
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(ActorRolloutRefWorker pid=328776) [rank1]:[W221 10:08:21.136445407 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
(ActorRolloutRefWorker pid=328776) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::5ZAuQSActorRolloutRefWorker_0:1:ActorRolloutRefWorker.__init__() (pid=328776, ip=10.17.176.67, actor_id=08452910c3c019b8e00ed93801000000, repr=<verl.workers.fsdp_workers.ActorRolloutRefWorker object at 0x7ff00c3f2850>)
(ActorRolloutRefWorker pid=328776)   File "/verl/verl/workers/fsdp_workers.py", line 88, in __init__
(ActorRolloutRefWorker pid=328776)     self.device_mesh = create_device_mesh(world_size=world_size, fsdp_size=self.config.actor.fsdp_config.fsdp_size)
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__ [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ActorRolloutRefWorker pid=328776)     self._format_and_raise(key=key, value=None, cause=e)
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776)     format_and_raise( [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776)     _raise(ex, cause) [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776)     raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776)     return self._get_impl(
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
(ActorRolloutRefWorker pid=328776)     node = self._get_child(
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
(ActorRolloutRefWorker pid=328776)     child = self._get_node(
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 475, in _get_node
(ActorRolloutRefWorker pid=328776)     self._validate_get(key)
(ActorRolloutRefWorker pid=328776)   File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get
(ActorRolloutRefWorker pid=328776)     self._format_and_raise(
(ActorRolloutRefWorker pid=328776) omegaconf.errors.ConfigAttributeError: Key 'actor' is not in struct
(ActorRolloutRefWorker pid=328776)     full_key: actor
(ActorRolloutRefWorker pid=328776)     object_type=dict

Am I missing something? I don't know if there is something wrong with the config.

Di-viner avatar Feb 21 '25 02:02 Di-viner

If you add some dummy args to the end of your bash script: +actor.fsdp_config.wrap_policy.min_num_params=0
+actor.fsdp_config.param_offload=False
+actor.fsdp_config.grad_offload=False
+actor.fsdp_config.optimizer_offload=False
+actor.fsdp_config.fsdp_size=-1

This will pass the above but show a new error: line 266, in _build_model_optimizer lr=optim_config.lr, AttributeError: 'NoneType' object has no attribute 'lr'

Andrewzh112 avatar Feb 23 '25 04:02 Andrewzh112

I think adding the relevant parameter section about FSDP to verl/trainer/config/generation.yaml (copy from ppo_trainer.yaml)and making some code modifications will enable it to run properly. This bug likely stems from the situation where, during the call of the actor module, the corresponding content isn't filled in the parameter template.

Image Image

BearBiscuit05 avatar Feb 23 '25 07:02 BearBiscuit05

Thanks, @BearBiscuit05. After I modified verl/trainer/config/generation.yaml and verl/train/main_generation.py as described in #351, I ran the bash script again:

python3 -m verl.trainer.main_generation \
    trainer.nnodes=1 \
    trainer.n_gpus_per_node=4 \
    data.path=.. \
    data.prompt_key=prompt \
    data.n_samples=1 \
    data.output_path=.. \
    model.path=.. \
    +model.trust_remote_code=True \
    rollout.temperature=1.0 \
    rollout.top_k=-1 \
    rollout.top_p=1.0 \
    rollout.prompt_length=2048 \
    rollout.response_length=2048 \
    rollout.tensor_model_parallel_size=4 \
    rollout.gpu_memory_utilization=0.8

But got another error:

[1/11] Start to generate.
(ActorRolloutRefWorker pid=1288894) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False}
(ActorRolloutRefWorker pid=1288894) After building vllm rollout, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375
(ActorRolloutRefWorker pid=1288894) After building sharding manager, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375
[2/11] Start to process.
[2/11] Start to generate.
[3/11] Start to process.
[3/11] Start to generate.
[4/11] Start to process.
[4/11] Start to generate.
[5/11] Start to process.
[5/11] Start to generate.
[6/11] Start to process.
[6/11] Start to generate.
[7/11] Start to process.
[7/11] Start to generate.
[8/11] Start to process.
[8/11] Start to generate.
[9/11] Start to process.
[9/11] Start to generate.
[10/11] Start to process.
[10/11] Start to generate.
[11/11] Start to process.
[11/11] Start to generate.
Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=4', 'data.path=..', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=..', 'model.path=..', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=-1', 'rollout.top_p=1.0', 'rollout.prompt_length=2048', 'rollout.response_length=2048', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.8']
Traceback (most recent call last):
  File "/verl/verl/trainer/main_generation.py", line 112, in main
    output = wg.generate_sequences(data)
  File "/verl/verl/single_controller/ray/base.py", line 39, in func
    args, kwargs = dispatch_fn(self, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto
    splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
    splitted_args.append(arg.chunk(chunks=chunks))
  File "/verl/verl/protocol.py", line 499, in chunk
    assert len(
AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 4.

Does this mean that further data processing is required?

Di-viner avatar Feb 24 '25 03:02 Di-viner

Interesting. The params I used seem to be running normally. I'll try the params you set again.

[9/11] Start to process.
[9/11] Start to generate.
[10/11] Start to process.
[10/11] Start to generate.
[11/11] Start to process.
dp_size 2 is not divisible by real_batch_size 39, add 1 dummy data
[11/11] Start to generate.
(ActorRolloutRefWorker pid=110496) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 1024, 'detokenize': False, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.7, 'ignore_eos': False} [repeated 3x across cluster]
(ActorRolloutRefWorker pid=110496) /root/miniconda3/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 3x across cluster]
(ActorRolloutRefWorker pid=110496)   warnings.warn( [repeated 3x across cluster]
root@ad7a4d25dfaf:/verl/tests/generation# 

BearBiscuit05 avatar Feb 24 '25 03:02 BearBiscuit05

Thanks, @BearBiscuit05. After I modified verl/trainer/config/generation.yaml and verl/train/main_generation.py as described in #351, I ran the bash script again:

python3 -m verl.trainer.main_generation \
    trainer.nnodes=1 \
    trainer.n_gpus_per_node=4 \
    data.path=.. \
    data.prompt_key=prompt \
    data.n_samples=1 \
    data.output_path=.. \
    model.path=.. \
    +model.trust_remote_code=True \
    rollout.temperature=1.0 \
    rollout.top_k=-1 \
    rollout.top_p=1.0 \
    rollout.prompt_length=2048 \
    rollout.response_length=2048 \
    rollout.tensor_model_parallel_size=4 \
    rollout.gpu_memory_utilization=0.8

But got another error:

[1/11] Start to generate.
(ActorRolloutRefWorker pid=1288894) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False}
(ActorRolloutRefWorker pid=1288894) After building vllm rollout, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375
(ActorRolloutRefWorker pid=1288894) After building sharding manager, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375
[2/11] Start to process.
[2/11] Start to generate.
[3/11] Start to process.
[3/11] Start to generate.
[4/11] Start to process.
[4/11] Start to generate.
[5/11] Start to process.
[5/11] Start to generate.
[6/11] Start to process.
[6/11] Start to generate.
[7/11] Start to process.
[7/11] Start to generate.
[8/11] Start to process.
[8/11] Start to generate.
[9/11] Start to process.
[9/11] Start to generate.
[10/11] Start to process.
[10/11] Start to generate.
[11/11] Start to process.
[11/11] Start to generate.
Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=4', 'data.path=..', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=..', 'model.path=..', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=-1', 'rollout.top_p=1.0', 'rollout.prompt_length=2048', 'rollout.response_length=2048', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.8']
Traceback (most recent call last):
  File "/verl/verl/trainer/main_generation.py", line 112, in main
    output = wg.generate_sequences(data)
  File "/verl/verl/single_controller/ray/base.py", line 39, in func
    args, kwargs = dispatch_fn(self, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto
    splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
  File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
    splitted_args.append(arg.chunk(chunks=chunks))
  File "/verl/verl/protocol.py", line 499, in chunk
    assert len(
AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 4.

Does this mean that further data processing is required?

meet the same error when num_gpus==tp, if set dp>1 may skip this problem.

BearBiscuit05 avatar Feb 24 '25 04:02 BearBiscuit05

Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=8', 'data.path=/mnt/disk2/wy/search-r1/multidoc_qa/train_data/test.parquet', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=./tmp', 'model.path=/infinity/models/Qwen2.5-7B-Instruct-1M', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=50', 'rollout.top_p=0.7', 'rollout.prompt_length=2048', 'rollout.response_length=1024', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.6'] Traceback (most recent call last): File "/mnt/disk2/wy/reft-exp/verl/verl/trainer/main_generation.py", line 107, in main output = wg.generate_sequences(data) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/ray/base.py", line 39, in func args, kwargs = dispatch_fn(self, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto splitted_args.append(arg.chunk(chunks=chunks)) File "/mnt/disk2/wy/reft-exp/verl/verl/protocol.py", line 499, in chunk assert len( AssertionError: only support equal chunk. Got size of DataProto 4 and chunk 8.

come across the same error.

command: sh run_gen_qwen2.5_7b_1M.sh 8 ./tmp

`set -x

if [ "$#" -lt 2 ]; then echo "Usage: run_gen_qwen2.5_7b_1M.sh <nproc_per_node> <save_path> [other_configs...]" exit 1 fi

nproc_per_node=$1 save_path=$2

Shift the arguments so $@ refers to the rest

shift 2

python3 -m verl.trainer.main_generation
trainer.nnodes=1
trainer.n_gpus_per_node=$nproc_per_node
data.path=/mnt/disk2/train_data/test.parquet
data.prompt_key=prompt
data.n_samples=1
data.output_path=$save_path
model.path=/infinity/models/Qwen2.5-7B-Instruct-1M
+model.trust_remote_code=True
rollout.temperature=1.0
rollout.top_k=50
rollout.top_p=0.7
rollout.prompt_length=2048
rollout.response_length=1024
rollout.tensor_model_parallel_size=4
rollout.gpu_memory_utilization=0.6 `

RoacherM avatar Mar 12 '25 12:03 RoacherM

The simplest way to fix this issue is to manually add dummy_data in code.

BearBiscuit05 avatar Mar 12 '25 12:03 BearBiscuit05

Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=8', 'data.path=/mnt/disk2/wy/search-r1/multidoc_qa/train_data/test.parquet', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=./tmp', 'model.path=/infinity/models/Qwen2.5-7B-Instruct-1M', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=50', 'rollout.top_p=0.7', 'rollout.prompt_length=2048', 'rollout.response_length=1024', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.6'] Traceback (most recent call last): File "/mnt/disk2/wy/reft-exp/verl/verl/trainer/main_generation.py", line 107, in main output = wg.generate_sequences(data) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/ray/base.py", line 39, in func args, kwargs = dispatch_fn(self, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto splitted_args.append(arg.chunk(chunks=chunks)) File "/mnt/disk2/wy/reft-exp/verl/verl/protocol.py", line 499, in chunk assert len( AssertionError: only support equal chunk. Got size of DataProto 4 and chunk 8.

come across the same error.

command: sh run_gen_qwen2.5_7b_1M.sh 8 ./tmp

`set -x

if [ "$#" -lt 2 ]; then echo "Usage: run_gen_qwen2.5_7b_1M.sh <nproc_per_node> <save_path> [other_configs...]" exit 1 fi

nproc_per_node=$1 save_path=$2

Shift the arguments so $@ refers to the rest

shift 2

python3 -m verl.trainer.main_generation trainer.nnodes=1 trainer.n_gpus_per_node=$nproc_per_node data.path=/mnt/disk2/train_data/test.parquet data.prompt_key=prompt data.n_samples=1 data.output_path=$save_path model.path=/infinity/models/Qwen2.5-7B-Instruct-1M +model.trust_remote_code=True rollout.temperature=1.0 rollout.top_k=50 rollout.top_p=0.7 rollout.prompt_length=2048 rollout.response_length=1024 rollout.tensor_model_parallel_size=4 rollout.gpu_memory_utilization=0.6 `

fixed: The batch_size must be an integer multiple of n_gpus_per_node

RoacherM avatar Mar 12 '25 14:03 RoacherM