softlearning icon indicating copy to clipboard operation
softlearning copied to clipboard

Parallelization

Open kapsl opened this issue 6 years ago • 3 comments

Hi, as far as I understand it, SAC currently works for training with a single agent?

Are there plans to support distributed training like done in Surreal?

kapsl avatar Dec 18 '18 08:12 kapsl

You're right, SAC currently work only for single agent training. Unfortunately, we don't currently have plans to support distributed training at least in the near future. I'm planning to contribute SAC algorithm to Ray at some point, which might be a good place to introduce some parallelization with the training, but can't guarantee anything.

hartikainen avatar Dec 19 '18 01:12 hartikainen

Hi.

I'm trying to run the code on a cluter computer but it's showing these messages and it doesn't work: ' == Status == Using FIFO scheduling algorithm. Resources requested: 0/20 CPUs, 0/0 GPUs Memory usage on this node: 73.5/270.3 GB Result logdir: /home/babadia1/ray_results/gym/HalfCheetah/v3/2019-08-24T11-33-28-my-sac-experiment-1 Number of trials: 1 ({'ERROR': 1}) ERROR trials:

  • id=d91b34f4-seed=5141: ERROR, 4 failures: /home/babadia1/ray_results/gym/HalfCheetah/v3/2019-08-24T11-33-28-my-sac-experiment-1/id=d91b34f4-seed=5141_2019-08-24_11-33-28woxscczk/error_2019-08-24_11-33-53.txt

Traceback (most recent call last): File "/home/babadia1/.conda/envs/sac/bin/softlearning", line 11, in load_entry_point('softlearning', 'console_scripts', 'softlearning')() File "/scratch/work/babadia1/MotionChunking/softlearning/softlearning/scripts/console_scripts.py", line 202, in main return cli() File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 764, in call return self.main(*args, **kwargs) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/scratch/work/babadia1/MotionChunking/softlearning/softlearning/scripts/console_scripts.py", line 71, in run_example_local_cmd return run_example_local(example_module_name, example_argv) File "/scratch/work/babadia1/MotionChunking/softlearning/examples/instrument.py", line 224, in run_example_local reuse_actors=True) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/ray/tune/tune.py", line 262, in run raise TuneError("Trials did not complete", errored_trials) ray.tune.error.TuneError: ('Trials did not complete', [id=d91b34f4-seed=5141]) '

I guess it's because of the multi-processing? Is there anyway to disable that so it just runs as a single processs? Or is it some other issue?

donamin avatar Aug 24 '19 08:08 donamin

Hi.

I'm trying to run the code on a cluter computer but it's showing these messages and it doesn't work: ' == Status == Using FIFO scheduling algorithm. Resources requested: 0/20 CPUs, 0/0 GPUs Memory usage on this node: 73.5/270.3 GB Result logdir: /home/babadia1/ray_results/gym/HalfCheetah/v3/2019-08-24T11-33-28-my-sac-experiment-1 Number of trials: 1 ({'ERROR': 1}) ERROR trials:

  • id=d91b34f4-seed=5141: ERROR, 4 failures: /home/babadia1/ray_results/gym/HalfCheetah/v3/2019-08-24T11-33-28-my-sac-experiment-1/id=d91b34f4-seed=5141_2019-08-24_11-33-28woxscczk/error_2019-08-24_11-33-53.txt

Traceback (most recent call last): File "/home/babadia1/.conda/envs/sac/bin/softlearning", line 11, in load_entry_point('softlearning', 'console_scripts', 'softlearning')() File "/scratch/work/babadia1/MotionChunking/softlearning/softlearning/scripts/console_scripts.py", line 202, in main return cli() File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 764, in call return self.main(*args, **kwargs) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/scratch/work/babadia1/MotionChunking/softlearning/softlearning/scripts/console_scripts.py", line 71, in run_example_local_cmd return run_example_local(example_module_name, example_argv) File "/scratch/work/babadia1/MotionChunking/softlearning/examples/instrument.py", line 224, in run_example_local reuse_actors=True) File "/home/babadia1/.conda/envs/sac/lib/python3.6/site-packages/ray/tune/tune.py", line 262, in run raise TuneError("Trials did not complete", errored_trials) ray.tune.error.TuneError: ('Trials did not complete', [id=d91b34f4-seed=5141]) '

I guess it's because of the multi-processing? Is there anyway to disable that so it just runs as a single processs? Or is it some other issue?

Hi, I also meet the same error, how can I fix it ?

Amanda2024 avatar Sep 23 '20 01:09 Amanda2024