flower
flower copied to clipboard
RuntimeError: Simulation crashed.
What is your question?
With a dictionary, you tell Flower's VirtualClientEngine that each
client needs exclusive access to these many resources in order to run
client_resources = {"num_cpus": 1, "num_gpus": 0.0}
Let's disable tqdm progress bar in the main thread (used by the server)
disable_progress_bar()
运行下面代码
history = fl.simulation.start_simulation(
client_fn=client_fn_callback, # a callback to construct a client
num_clients=NUM_CLIENTS, # total number of clients in the experiment
config=fl.server.ServerConfig(num_rounds=10), # let's run for 10 rounds
strategy=strategy, # the strategy that will orchestrate the whole FL pipeline
client_resources=client_resources,
actor_kwargs={
"on_actor_init_fn": disable_progress_bar # disable tqdm on each actor/process spawning virtual clients
},
)
产生bug
INFO flwr 2023-12-26 17:11:14,661 | app.py:178 | Starting Flower simulation, config: ServerConfig(num_rounds=10, round_timeout=None)
2023-12-26 17:11:17,056 WARNING utils.py:585 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning.
2023-12-26 17:11:18,167 INFO worker.py:1621 -- Started a local Ray instance.
INFO flwr 2023-12-26 17:11:19,197 | app.py:213 | Flower VCE: Ray initialized with resources: {'CPU': 12.0, 'node:internal_head': 1.0, 'accelerator_type:G': 1.0, 'GPU': 1.0, 'object_store_memory': 27794835456.0, 'node:172.17.0.5': 1.0, 'memory': 55589670912.0}
INFO flwr 2023-12-26 17:11:19,199 | app.py:219 | Optimize your simulation with Flower VCE: https://flower.dev/docs/framework/how-to-run-simulations.html
INFO flwr 2023-12-26 17:11:19,200 | app.py:242 | Flower VCE: Resources for each Virtual Client: {'num_cpus': 1, 'num_gpus': 0.0}
INFO flwr 2023-12-26 17:11:19,268 | app.py:288 | Flower VCE: Creating VirtualClientEngineActorPool with 12 actors
INFO flwr 2023-12-26 17:11:19,270 | server.py:89 | Initializing global parameters
INFO flwr 2023-12-26 17:11:19,272 | server.py:276 | Requesting initial parameters from one random client
ERROR flwr 2023-12-26 17:11:24,866 | ray_client_proxy.py:145 | Traceback (most recent call last):
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 138, in _submit_job
res = self.actor_pool.get_client_result(self.cid, timeout)
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 414, in get_client_result
return self._fetch_future_result(cid)
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 300, in _fetch_future_result
res_cid, res = ray.get(future) # type: (str, ClientRes)
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/worker.py", line 2524, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ClientException): ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>)
File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn
File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition
return partitioner.load_partition(node_id)
File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition
return self.dataset.shard(
AttributeError: 'RandomSampler' object has no attribute 'shard'
The above exception was the direct cause of the following exception:
ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:
A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: 'RandomSampler' object has no attribute 'shard'\n',)
ERROR flwr 2023-12-26 17:11:24,868 | ray_client_proxy.py:146 | ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'
The above exception was the direct cause of the following exception:
ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:
A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: 'RandomSampler' object has no attribute 'shard'\n',) ERROR flwr 2023-12-26 17:11:24,869 | app.py:313 | ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'
The above exception was the direct cause of the following exception:
ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:
A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: 'RandomSampler' object has no attribute 'shard'\n',) ERROR flwr 2023-12-26 17:11:24,872 | app.py:314 | Traceback (most recent call last): File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/app.py", line 308, in start_simulation hist = run_fl( File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/app.py", line 225, in run_fl hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py", line 90, in fit self.parameters = self._get_initial_parameters(timeout=timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py", line 279, in _get_initial_parameters get_parameters_res = random_client.get_parameters(ins=ins, timeout=timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 180, in get_parameters res = self._submit_job(get_parameters, timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 147, in _submit_job raise ex File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py", line 138, in _submit_job res = self.actor_pool.get_client_result(self.cid, timeout) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 414, in get_client_result return self._fetch_future_result(cid) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 300, in _fetch_future_result res_cid, res = ray.get(future) # type: (str, ClientRes) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/worker.py", line 2524, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ClientException): ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'
The above exception was the direct cause of the following exception:
ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:
A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: 'RandomSampler' object has no attribute 'shard'\n',)
ERROR flwr 2023-12-26 17:11:24,873 | app.py:315 | Your simulation crashed :(. This could be because of several reasons.The most common are:
> Your system couldn't fit a single VirtualClient: try lowering client_resources.
> All the actors in your pool crashed. This could be because:
- You clients hit an out-of-memory (OOM) error and actors couldn't recover from it. Try launching your simulation with more generous client_resources setting (i.e. it seems {'num_cpus': 1, 'num_gpus': 0.0} is not enough for your workload). Use fewer concurrent actors.
- You were running a multi-node simulation and all worker nodes disconnected. The head node might still be alive but cannot accommodate any actor with resources: {'num_cpus': 1, 'num_gpus': 0.0}.
RayTaskError(ClientException) Traceback (most recent call last) File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/app.py:308, in start_simulation(client_fn, num_clients, clients_ids, client_resources, server, config, strategy, client_manager, ray_init_args, keep_initialised, actor_type, actor_kwargs, actor_scheduling) 306 try: 307 # Start training --> 308 hist = run_fl( 309 server=initialized_server, 310 config=initialized_config, 311 ) 312 except Exception as ex:
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/app.py:225, in run_fl(server, config) 224 """Train a model on the given server and return the History object.""" --> 225 hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout) 226 log(INFO, "app_fit: losses_distributed %s", str(hist.losses_distributed))
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py:90, in Server.fit(self, num_rounds, timeout) 89 log(INFO, "Initializing global parameters") ---> 90 self.parameters = self._get_initial_parameters(timeout=timeout) 91 log(INFO, "Evaluating initial parameters")
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/server/server.py:279, in Server._get_initial_parameters(self, timeout) 278 ins = GetParametersIns(config={}) --> 279 get_parameters_res = random_client.get_parameters(ins=ins, timeout=timeout) 280 log(INFO, "Received initial parameters from one random client")
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py:180, in RayActorClientProxy.get_parameters(self, ins, timeout) 175 return maybe_call_get_parameters( 176 client=client, 177 get_parameters_ins=ins, 178 ) --> 180 res = self._submit_job(get_parameters, timeout) 182 return cast( 183 common.GetParametersRes, 184 res, 185 )
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py:147, in RayActorClientProxy._submit_job(self, job_fn, timeout) 146 log(ERROR, ex) --> 147 raise ex 149 return res
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py:138, in RayActorClientProxy._submit_job(self, job_fn, timeout) 134 self.actor_pool.submit_client_job( 135 lambda a, c_fn, j_fn, cid: a.run.remote(c_fn, j_fn, cid), 136 (self.client_fn, job_fn, self.cid), 137 ) --> 138 res = self.actor_pool.get_client_result(self.cid, timeout) 140 except Exception as ex:
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py:414, in VirtualClientEngineActorPool.get_client_result(self, cid, timeout) 413 # Fetch result belonging to the VirtualClient calling this method --> 414 return self._fetch_future_result(cid)
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py:300, in VirtualClientEngineActorPool._fetch_future_result(self, cid) 299 future: ObjectRef[Any] = self._cid_to_future[cid]["future"] # type: ignore --> 300 res_cid, res = ray.get(future) # type: (str, ClientRes) 301 except ray.exceptions.RayActorError as ex:
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init.
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook.
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/ray/_private/worker.py:2524, in get(object_refs, timeout) 2523 if isinstance(value, RayTaskError): -> 2524 raise value.as_instanceof_cause() 2525 else:
RayTaskError(ClientException): ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition return partitioner.load_partition(node_id) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition return self.dataset.shard( AttributeError: 'RandomSampler' object has no attribute 'shard'
The above exception was the direct cause of the following exception:
ray::DefaultActor.run() (pid=15387, ip=172.17.0.5, actor_id=ba3372c4e219f3601da6569901000000, repr=<flwr.simulation.ray_transport.ray_actor.DefaultActor object at 0x7f8398162fe0>) File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 84, in run raise ClientException(str(message)) from ex flwr.simulation.ray_transport.ray_actor.ClientException:
A ClientException occurred.('\n\tSomething went wrong when running your client workload.\n\tClient 66 crashed when the DefaultActor was running its workload.\n\tException triggered on the client side: Traceback (most recent call last):\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 70, in run\n client = check_clientfn_returns_client(client_fn(cid))\n File "/tmp/ipykernel_13261/2759444332.py", line 15, in client_fn\n File "/root/autodl-tmp/flower/examples/simulation-pytorch/fl_dataset.py", line 53, in load_partition\n return partitioner.load_partition(node_id)\n File "/root/miniconda3/envs/flower/lib/python3.10/site-packages/flwr_datasets/partitioner/iid_partitioner.py", line 50, in load_partition\n return self.dataset.shard(\nAttributeError: 'RandomSampler' object has no attribute 'shard'\n',)
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last) Cell In[20], line 8 5 # Let's disable tqdm progress bar in the main thread (used by the server) 6 disable_progress_bar() ----> 8 history = fl.simulation.start_simulation( 9 client_fn=client_fn_callback, # a callback to construct a client 10 num_clients=NUM_CLIENTS, # total number of clients in the experiment 11 config=fl.server.ServerConfig(num_rounds=10), # let's run for 10 rounds 12 strategy=strategy, # the strategy that will orchestrate the whole FL pipeline 13 client_resources=client_resources, 14 actor_kwargs={ 15 "on_actor_init_fn": disable_progress_bar # disable tqdm on each actor/process spawning virtual clients 16 }, 17 )
File ~/miniconda3/envs/flower/lib/python3.10/site-packages/flwr/simulation/app.py:332, in start_simulation(client_fn, num_clients, clients_ids, client_resources, server, config, strategy, client_manager, ray_init_args, keep_initialised, actor_type, actor_kwargs, actor_scheduling) 314 log(ERROR, traceback.format_exc()) 315 log( 316 ERROR, 317 "Your simulation crashed :(. This could be because of several reasons." (...) 330 client_resources, 331 ) --> 332 raise RuntimeError("Simulation crashed.") from ex 334 finally: 335 # Stop time monitoring resources in cluster 336 f_stop.set()
RuntimeError: Simulation crashed.
这个问题怎么解决。
hi, it seems the issue comes from the data partitioning process. could you share the related code when using Flower Datasets?
hi, it seems the issue comes from the data partitioning process. could you share the related code when using Flower Datasets?
Same problem, could you help me?