flow icon indicating copy to clipboard operation
flow copied to clipboard

singleagent_merge training fails

Open soroushasri opened this issue 4 years ago • 1 comments

when I use "python examples/train.py singleagent_merge", Rllib keeps failing to train. the error is:

2020-05-05 17:50:40,177 ERROR trial_runner.py:521 -- Trial PPO_MergePOEnv-v0_00000: Error processing event. Traceback (most recent call last): File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/worker.py", line 1504, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ValueError): ray::PPO.train() (pid=21386, ip=192.168.44.135) File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 414, in ray._raylet.execute_task.function_executor File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 495, in train raise e File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 484, in train result = Trainable.train(self) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/tune/trainable.py", line 261, in train result = self._train() File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 151, in _train fetches = self.optimizer.step() File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 148, in step self.train_batch_size) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/optimizers/rollout.py", line 25, in collect_samples next_sample = ray_get_and_free(fut_sample) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/utils/memory.py", line 32, in ray_get_and_free return ray.get(object_ids) ray.exceptions.RayTaskError(ValueError): ray::RolloutWorker.sample() (pid=21495, ip=192.168.44.135) File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 414, in ray._raylet.execute_task.function_executor File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 510, in sample batches = [self.input_reader.next()] File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 54, in next batches = [self.get_data()] File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 98, in get_data item = next(self.rollout_provider) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 349, in _env_runner callbacks, soft_horizon, no_done_at_end) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 444, in _process_observations policy_id).transform(raw_obs) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/models/preprocessors.py", line 162, in transform self.check_shape(observation) File "/home/soroush/anaconda3/envs/flow/lib/python3.6/site-packages/ray/rllib/models/preprocessors.py", line 61, in check_shape self._obs_space, observation) ValueError: ('Observation outside expected value range', Box(25,), array([ 0.29654689, 0.23655217, 0.05186453, 0.05527995, 0.01125585, 0.19895927, 0.3223683 , 0.29241097, 0.05537732, 0.01002027, 0.1181798 , -0.01068187, 0.01191296, 0.01008524, 0.00615735, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])) 2020-05-05 17:50:40,180 INFO trial_runner.py:636 -- Trial PPO_MergePOEnv-v0_00000: Attempting to restore trial state from last checkpoint. == Status == Memory usage on this node: 3.4/12.7 GiB Using FIFO scheduling algorithm. Resources requested: 3/3 CPUs, 0/0 GPUs, 0.0/9.38 GiB heap, 0.0/0.1 GiB objects Result logdir: /home/soroush/ray_results/stabilizing_open_network_merges Number of trials: 1 (1 RUNNING) +-------------------------+----------+-------+

soroushasri avatar May 05 '20 13:05 soroushasri

i also fail to train single agent merge, multiagent merge fails too....QAQ

SHITIANYU-hue avatar Jun 05 '20 23:06 SHITIANYU-hue