Tried to run main_generation.py, but it raised KeyError: ConfigAttributeError: Key 'actor' is not in struct.
Thanks for the great work!
Tried to run main_generation.py, but got a KeyError: ConfigAttributeError: Key 'actor' is not in struct.
similar to examples/generation/run_deepseek_v2_lite_math.sh:
python3 -m verl.trainer.main_generation \
trainer.nnodes=1 \
trainer.n_gpus_per_node=2 \
data.path=... \
data.prompt_key=prompt \
data.n_samples=1 \
data.output_path=... \
model.path=... \
+model.trust_remote_code=True \
rollout.temperature=1.0 \
rollout.top_k=50 \
rollout.top_p=0.7 \
rollout.prompt_length=2048 \
rollout.response_length=2048 \
rollout.tensor_model_parallel_size=2 \
rollout.gpu_memory_utilization=0.8
but got a KeyError:
Traceback (most recent call last):
File "/verl/verl/trainer/main_generation.py", line 68, in main
wg.init_model()
File "/verl/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
File "/verl/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/verl/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/verl/lib/python3.9/site-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/verl/lib/python3.9/site-packages/ray/_private/worker.py", line 908, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::5ZAuQSActorRolloutRefWorker_0:1:ActorRolloutRefWorker.__init__() (pid=328776, ip=10.17.176.67, actor_id=08452910c3c019b8e00ed93801000000, repr=<verl.workers.fsdp_workers.ActorRolloutRefWorker object at 0x7ff00c3f2850>)
File "/verl/verl/workers/fsdp_workers.py", line 88, in __init__
self.device_mesh = create_device_mesh(world_size=world_size, fsdp_size=self.config.actor.fsdp_config.fsdp_size)
File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 359, in __getattr__
self._format_and_raise(key=key, value=None, cause=e)
File "/verl/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 819, in format_and_raise
_raise(ex, cause)
File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
return self._get_impl(
File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
node = self._get_child(
File "/verl/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
child = self._get_node(
File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 475, in _get_node
self._validate_get(key)
File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get
self._format_and_raise(
File "/verl/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
omegaconf.errors.ConfigAttributeError: Key 'actor' is not in struct
full_key: actor
object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(ActorRolloutRefWorker pid=328776) [rank1]:[W221 10:08:21.136445407 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(ActorRolloutRefWorker pid=328776) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::5ZAuQSActorRolloutRefWorker_0:1:ActorRolloutRefWorker.__init__() (pid=328776, ip=10.17.176.67, actor_id=08452910c3c019b8e00ed93801000000, repr=<verl.workers.fsdp_workers.ActorRolloutRefWorker object at 0x7ff00c3f2850>)
(ActorRolloutRefWorker pid=328776) File "/verl/verl/workers/fsdp_workers.py", line 88, in __init__
(ActorRolloutRefWorker pid=328776) self.device_mesh = create_device_mesh(world_size=world_size, fsdp_size=self.config.actor.fsdp_config.fsdp_size)
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__ [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ActorRolloutRefWorker pid=328776) self._format_and_raise(key=key, value=None, cause=e)
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/base.py", line 231, in _format_and_raise [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776) format_and_raise( [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 899, in format_and_raise [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776) _raise(ex, cause) [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/_utils.py", line 797, in _raise [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776) raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace [repeated 2x across cluster]
(ActorRolloutRefWorker pid=328776) return self._get_impl(
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
(ActorRolloutRefWorker pid=328776) node = self._get_child(
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
(ActorRolloutRefWorker pid=328776) child = self._get_node(
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 475, in _get_node
(ActorRolloutRefWorker pid=328776) self._validate_get(key)
(ActorRolloutRefWorker pid=328776) File "/verl/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get
(ActorRolloutRefWorker pid=328776) self._format_and_raise(
(ActorRolloutRefWorker pid=328776) omegaconf.errors.ConfigAttributeError: Key 'actor' is not in struct
(ActorRolloutRefWorker pid=328776) full_key: actor
(ActorRolloutRefWorker pid=328776) object_type=dict
Am I missing something? I don't know if there is something wrong with the config.
If you add some dummy args to the end of your bash script:
+actor.fsdp_config.wrap_policy.min_num_params=0
+actor.fsdp_config.param_offload=False
+actor.fsdp_config.grad_offload=False
+actor.fsdp_config.optimizer_offload=False
+actor.fsdp_config.fsdp_size=-1
This will pass the above but show a new error: line 266, in _build_model_optimizer lr=optim_config.lr, AttributeError: 'NoneType' object has no attribute 'lr'
I think adding the relevant parameter section about FSDP to verl/trainer/config/generation.yaml (copy from ppo_trainer.yaml)and making some code modifications will enable it to run properly. This bug likely stems from the situation where, during the call of the actor module, the corresponding content isn't filled in the parameter template.
Thanks, @BearBiscuit05. After I modified verl/trainer/config/generation.yaml and verl/train/main_generation.py as described in #351, I ran the bash script again:
python3 -m verl.trainer.main_generation \
trainer.nnodes=1 \
trainer.n_gpus_per_node=4 \
data.path=.. \
data.prompt_key=prompt \
data.n_samples=1 \
data.output_path=.. \
model.path=.. \
+model.trust_remote_code=True \
rollout.temperature=1.0 \
rollout.top_k=-1 \
rollout.top_p=1.0 \
rollout.prompt_length=2048 \
rollout.response_length=2048 \
rollout.tensor_model_parallel_size=4 \
rollout.gpu_memory_utilization=0.8
But got another error:
[1/11] Start to generate.
(ActorRolloutRefWorker pid=1288894) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False}
(ActorRolloutRefWorker pid=1288894) After building vllm rollout, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375
(ActorRolloutRefWorker pid=1288894) After building sharding manager, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375
[2/11] Start to process.
[2/11] Start to generate.
[3/11] Start to process.
[3/11] Start to generate.
[4/11] Start to process.
[4/11] Start to generate.
[5/11] Start to process.
[5/11] Start to generate.
[6/11] Start to process.
[6/11] Start to generate.
[7/11] Start to process.
[7/11] Start to generate.
[8/11] Start to process.
[8/11] Start to generate.
[9/11] Start to process.
[9/11] Start to generate.
[10/11] Start to process.
[10/11] Start to generate.
[11/11] Start to process.
[11/11] Start to generate.
Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=4', 'data.path=..', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=..', 'model.path=..', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=-1', 'rollout.top_p=1.0', 'rollout.prompt_length=2048', 'rollout.response_length=2048', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.8']
Traceback (most recent call last):
File "/verl/verl/trainer/main_generation.py", line 112, in main
output = wg.generate_sequences(data)
File "/verl/verl/single_controller/ray/base.py", line 39, in func
args, kwargs = dispatch_fn(self, *args, **kwargs)
File "/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto
splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs)
File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto
splitted_args.append(arg.chunk(chunks=chunks))
File "/verl/verl/protocol.py", line 499, in chunk
assert len(
AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 4.
Does this mean that further data processing is required?
Interesting. The params I used seem to be running normally. I'll try the params you set again.
[9/11] Start to process.
[9/11] Start to generate.
[10/11] Start to process.
[10/11] Start to generate.
[11/11] Start to process.
dp_size 2 is not divisible by real_batch_size 39, add 1 dummy data
[11/11] Start to generate.
(ActorRolloutRefWorker pid=110496) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 1024, 'detokenize': False, 'temperature': 1.0, 'top_k': 50, 'top_p': 0.7, 'ignore_eos': False} [repeated 3x across cluster]
(ActorRolloutRefWorker pid=110496) /root/miniconda3/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 3x across cluster]
(ActorRolloutRefWorker pid=110496) warnings.warn( [repeated 3x across cluster]
root@ad7a4d25dfaf:/verl/tests/generation#
Thanks, @BearBiscuit05. After I modified
verl/trainer/config/generation.yamlandverl/train/main_generation.pyas described in #351, I ran the bash script again:python3 -m verl.trainer.main_generation \ trainer.nnodes=1 \ trainer.n_gpus_per_node=4 \ data.path=.. \ data.prompt_key=prompt \ data.n_samples=1 \ data.output_path=.. \ model.path=.. \ +model.trust_remote_code=True \ rollout.temperature=1.0 \ rollout.top_k=-1 \ rollout.top_p=1.0 \ rollout.prompt_length=2048 \ rollout.response_length=2048 \ rollout.tensor_model_parallel_size=4 \ rollout.gpu_memory_utilization=0.8But got another error:
[1/11] Start to generate. (ActorRolloutRefWorker pid=1288894) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 2048, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False} (ActorRolloutRefWorker pid=1288894) After building vllm rollout, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375 (ActorRolloutRefWorker pid=1288894) After building sharding manager, memory allocated (GB): 58.42914438247681, memory reserved (GB): 62.224609375 [2/11] Start to process. [2/11] Start to generate. [3/11] Start to process. [3/11] Start to generate. [4/11] Start to process. [4/11] Start to generate. [5/11] Start to process. [5/11] Start to generate. [6/11] Start to process. [6/11] Start to generate. [7/11] Start to process. [7/11] Start to generate. [8/11] Start to process. [8/11] Start to generate. [9/11] Start to process. [9/11] Start to generate. [10/11] Start to process. [10/11] Start to generate. [11/11] Start to process. [11/11] Start to generate. Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=4', 'data.path=..', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=..', 'model.path=..', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=-1', 'rollout.top_p=1.0', 'rollout.prompt_length=2048', 'rollout.response_length=2048', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.8'] Traceback (most recent call last): File "/verl/verl/trainer/main_generation.py", line 112, in main output = wg.generate_sequences(data) File "/verl/verl/single_controller/ray/base.py", line 39, in func args, kwargs = dispatch_fn(self, *args, **kwargs) File "/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs) File "/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto splitted_args.append(arg.chunk(chunks=chunks)) File "/verl/verl/protocol.py", line 499, in chunk assert len( AssertionError: only support equal chunk. Got size of DataProto 39 and chunk 4.Does this mean that further data processing is required?
meet the same error when num_gpus==tp, if set dp>1 may skip this problem.
Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=8', 'data.path=/mnt/disk2/wy/search-r1/multidoc_qa/train_data/test.parquet', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=./tmp', 'model.path=/infinity/models/Qwen2.5-7B-Instruct-1M', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=50', 'rollout.top_p=0.7', 'rollout.prompt_length=2048', 'rollout.response_length=1024', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.6'] Traceback (most recent call last): File "/mnt/disk2/wy/reft-exp/verl/verl/trainer/main_generation.py", line 107, in main output = wg.generate_sequences(data) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/ray/base.py", line 39, in func args, kwargs = dispatch_fn(self, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto splitted_args.append(arg.chunk(chunks=chunks)) File "/mnt/disk2/wy/reft-exp/verl/verl/protocol.py", line 499, in chunk assert len( AssertionError: only support equal chunk. Got size of DataProto 4 and chunk 8.
come across the same error.
command: sh run_gen_qwen2.5_7b_1M.sh 8 ./tmp
`set -x
if [ "$#" -lt 2 ]; then echo "Usage: run_gen_qwen2.5_7b_1M.sh <nproc_per_node> <save_path> [other_configs...]" exit 1 fi
nproc_per_node=$1 save_path=$2
Shift the arguments so $@ refers to the rest
shift 2
python3 -m verl.trainer.main_generation
trainer.nnodes=1
trainer.n_gpus_per_node=$nproc_per_node
data.path=/mnt/disk2/train_data/test.parquet
data.prompt_key=prompt
data.n_samples=1
data.output_path=$save_path
model.path=/infinity/models/Qwen2.5-7B-Instruct-1M
+model.trust_remote_code=True
rollout.temperature=1.0
rollout.top_k=50
rollout.top_p=0.7
rollout.prompt_length=2048
rollout.response_length=1024
rollout.tensor_model_parallel_size=4
rollout.gpu_memory_utilization=0.6
`
The simplest way to fix this issue is to manually add dummy_data in code.
Error executing job with overrides: ['trainer.nnodes=1', 'trainer.n_gpus_per_node=8', 'data.path=/mnt/disk2/wy/search-r1/multidoc_qa/train_data/test.parquet', 'data.prompt_key=prompt', 'data.n_samples=1', 'data.output_path=./tmp', 'model.path=/infinity/models/Qwen2.5-7B-Instruct-1M', '+model.trust_remote_code=True', 'rollout.temperature=1.0', 'rollout.top_k=50', 'rollout.top_p=0.7', 'rollout.prompt_length=2048', 'rollout.response_length=1024', 'rollout.tensor_model_parallel_size=4', 'rollout.gpu_memory_utilization=0.6'] Traceback (most recent call last): File "/mnt/disk2/wy/reft-exp/verl/verl/trainer/main_generation.py", line 107, in main output = wg.generate_sequences(data) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/ray/base.py", line 39, in func args, kwargs = dispatch_fn(self, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 275, in dispatch_dp_compute_data_proto splitted_args, splitted_kwargs = _split_args_kwargs_data_proto(worker_group.world_size, *args, **kwargs) File "/mnt/disk2/wy/reft-exp/verl/verl/single_controller/base/decorator.py", line 50, in _split_args_kwargs_data_proto splitted_args.append(arg.chunk(chunks=chunks)) File "/mnt/disk2/wy/reft-exp/verl/verl/protocol.py", line 499, in chunk assert len( AssertionError: only support equal chunk. Got size of DataProto 4 and chunk 8.
come across the same error.
command: sh run_gen_qwen2.5_7b_1M.sh 8 ./tmp
`set -x
if [ "$#" -lt 2 ]; then echo "Usage: run_gen_qwen2.5_7b_1M.sh <nproc_per_node> <save_path> [other_configs...]" exit 1 fi
nproc_per_node=$1 save_path=$2
Shift the arguments so $@ refers to the rest
shift 2
python3 -m verl.trainer.main_generation trainer.nnodes=1 trainer.n_gpus_per_node=$nproc_per_node data.path=/mnt/disk2/train_data/test.parquet data.prompt_key=prompt data.n_samples=1 data.output_path=$save_path model.path=/infinity/models/Qwen2.5-7B-Instruct-1M +model.trust_remote_code=True rollout.temperature=1.0 rollout.top_k=50 rollout.top_p=0.7 rollout.prompt_length=2048 rollout.response_length=1024 rollout.tensor_model_parallel_size=4 rollout.gpu_memory_utilization=0.6 `
fixed: The batch_size must be an integer multiple of n_gpus_per_node