DI-engine
DI-engine copied to clipboard
CPU utilization problem
- [ ] I have marked all applicable categories:
- [ ] exception-raising bug
- [ ] RL algorithm bug
- [x] system worker bug
- [x] system utils bug
- [ ] code design/refactor
- [ ] documentation request
- [ ] new feature request
- [x] I have visited the readme and doc
- [x] I have searched through the issue tracker and pr tracker
- [x] I have mentioned version numbers, operating system and environment, where applicable:
# ding version `v0.2.0`, linux platform
Issue Description
CPU utilization is not 100% and very low. (below 5% on average)
Steps to Reproduce
clone the repo and git checkout main
. (currently on 0fcfdf26
). Run python3 dizoo/slime_volley/entry/slime_volley_selfplay_ppo_main.py
. Open htop
to check CPU usage. Only one core is occupied on a multi-core machine.
What Do We Need?
During training, run command mpstat 3
. The column of %idle
is less than 20%
(Current value is 97%
)
Conclusion: SlimeVolley-v0
env is too tiny to fully utilize cpu when training.
You can try this test file for env, just run pytest -sv .
and run htop
in another terminal to check usage:
import pytest
import numpy as np
from easydict import EasyDict
from dizoo.slime_volley.envs.slime_volley_env import SlimeVolleyEnv
@pytest.mark.envtest
class TestSlimeVolley:
@pytest.mark.parametrize('agent_vs_agent', [True, False])
def test_slime_volley(self, agent_vs_agent):
total_rew = 0
env = SlimeVolleyEnv(EasyDict({'env_id': 'SlimeVolley-v0', 'agent_vs_agent': agent_vs_agent}))
# env.enable_save_replay('replay_video')
obs1 = env.reset()
done = False
print(env._env.observation_space)
print('observation is like:', obs1)
done = False
while True:
if agent_vs_agent:
action1 = np.random.randint(0, 2, (1, ))
action2 = np.random.randint(0, 2, (1, ))
action = [action1, action2]
else:
action = np.random.randint(0, 2, (1, ))
import time
time.sleep(0.01)
observations, rewards, done, infos = env.step(action)
total_rew += rewards[0]
obs1, obs2 = observations[0], observations[1]
assert obs1.shape == obs2.shape, (obs1.shape, obs2.shape)
if agent_vs_agent:
agent_lives, opponent_lives = infos[0]['ale.lives'], infos[1]['ale.lives']
if agent_vs_agent:
assert agent_lives == 0 or opponent_lives == 0, (agent_lives, opponent_lives)
print("total reward is:", total_rew)
If you run this file directly, you will find CPU usage is just ~5% like the following screenshot:
data:image/s3,"s3://crabby-images/444d9/444d948c035837509dec5d51642a8d07af19c4e4" alt="Screen Shot 2021-10-22 at 11 44 05 PM"
But if you comment the code time.sleep(0.01)
, and you will find ~100% CPU usage.
Note RL pipeline for collecting data is usually like this:
while True:
action = policy.forward(obs)
obs, rew, done, info = env.step(action)
...
Env and policy are called alternately, If env is too tiny, and the time of policy forward can't be less then 0.01s, the usage of CPU will very low like your case, so it's not the fault of DI-engine SyncSubprocessEnvManager.
I want to know your plan or target about training speed, cpu usage is not a good metric. Maybe other viewpoints can help you.
Plan/Target
Renting a 64 cores machine is not cheap. Overall goal is not to waste any CPU resources (no core is idle) and hence make convergence faster.
First, I think you misunderstood the issue. This is not about whether single core is 100% or not but rather talking about all cores 100% or not.
But if you comment the code time.sleep(0.01), and you will find ~100% CPU usage.
I commented the time.sleep(0.01)
and saw 1 core is at 100%. However remaining 63 cores are still nearly 0%.
Second, there are several issues here. Let me break them down a little bit.
1. Speed Benchmark
1.1 SubprocVecEnv Speed
Below is an example using SubprocVecEnv
from stable-baselines3
to train on CartPole-v1
. The result is that multi cores version trained 30x faster.
Step 1: Run pip3 install stable-baselines3[extra]
Step 2: Create a file main.py
with the content below and run python3 main.py
import time
import gym
import numpy as np
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.env_util import make_vec_env
import multiprocessing
from typing import Callable
def make_env(env_id: str, rank: int, seed: int = 0) -> Callable:
"""
Utility function for multiprocessed env.
:param env_id: (str) the environment ID
:param num_env: (int) the number of environment you wish to have in subprocesses
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
:return: (Callable)
"""
def _init() -> gym.Env:
env = gym.make(env_id)
env.seed(seed + rank)
return env
set_random_seed(seed)
return _init
env_id = "CartPole-v1"
num_cpu = multiprocessing.cpu_count() # Number of processes to use
if __name__ == '__main__':
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
model = A2C('MlpPolicy', env, verbose=0)
# We create a separate environment for evaluation
eval_env = gym.make(env_id)
n_timesteps = 25000
# Multiprocessed RL Training
start_time = time.time()
model.learn(n_timesteps)
total_time_multi = time.time() - start_time
print(f"Took {total_time_multi:.2f}s for multiprocessed version - {n_timesteps / total_time_multi:.2f} FPS")
# Single Process RL Training
single_process_model = A2C('MlpPolicy', env_id, verbose=0)
start_time = time.time()
single_process_model.learn(n_timesteps)
total_time_single = time.time() - start_time
print(f"Took {total_time_single:.2f}s for single process version - {n_timesteps / total_time_single:.2f} FPS")
print("Multiprocessed training is {:.2f}x faster!".format(total_time_single / total_time_multi))
Output on my machine is
Took 1.96s for multiprocessed version - 12777.25 FPS
Took 59.70s for single process version - 418.80 FPS
Multiprocessed training is 30.51x faster!
1.2 SampleFactory Speed
Sample Factory is another RL framework. You can follow the following step to produce FPS for CartPole-v1
.
Step 1: Run
git clone https://github.com/alex-petrenko/sample-factory.git
cd sample-factory
conda env create -f environment.yml
conda activate sample-factory
(Don't use pip here as pip might not be working) Step 2: Run
python -m sample_factory_examples.train_gym_env --algo=APPO --use_rnn=False --num_envs_per_worker=20 --policy_workers_per_policy=2 --recurrence=1 --with_vtrace=False --batch_size=512 --hidden_size=256 --encoder_type=mlp --encoder_subtype=mlp_mujoco --reward_scale=0.1 --save_every_sec=10 --experiment_summaries_interval=10 --experiment=example_gym_cartpole-v1 --env=gym_CartPole-v1
Output on my machine is
FPS is (10 sec: 42758.9, 60 sec: 43001.2, 300 sec: 43001.2). Total num frames: 1420800.
Throughput: 0: 47068.9. Samples: 1388960. Policy #0 lag: (min: 6.0, avg: 51.4, max: 105.0)
Avg episode reward: [(0, '483.310')]
which means FPS is 43001.2
1.3 DI-engine Speed
Running ding -m serial -c cartpole_a2c_config.py -s 0
gives me somthing like avg_envstep_per_sec: 2514.50
which means FPS is 2514.50
All above results were generated on a same machine and were running on same env CartPole-v1
. However, as we can see that DI-engine is 17 times slower. This should be fixed ASAP. Update SyncSubprocessEnvManager
with SubprocVecEnv
would be a good start. After that we can see if we can learn from SampleFactory
.
Our efficiency experiments on Atari/Mujoco environments are comparable with stable-basline3/tianshou. And our mock test benchmark results are shown in this link.
For you test results in cartpole
env, I will go deeper to validate why leads to FPS differences.
SampleFactory
is indeed the highest throughput example in APPO setting now. Collector(actor)-Learner asynchronously is a key point while our serial_pipeline
is a serial demo. We plan to add parallel training demo for Atari/Mujoco/SMAC in December.
For env manager (vec env), python multiprocessing and threading is not a good choice (due to GIL and python IPC). I have talked about this problem with tianshou's author last week, and we think a new env manager based on cpp thread pool is necessary, which will be a important new feature in November/December.
Thanks for the information.
Our efficiency experiments on Atari/Mujoco environments are comparable with stable-basline3/tianshou
This is good. But this only tests the collector. We should test the throughput of whole training end to end. This could be done sometime next year.
For env manager (vec env), python multiprocessing and threading is not a good choice
Could we please at least match SubprocVecEnv
's performance by end of October? We are still 5 times slower than SubprocVecEnv
, which is still Python not C++.
our serial_pipeline is a serial demo
Agree that we need asynchrony which could be achieved by starting two multiprocessing.Process
. I wish we could have a MVP (minimum viable product) demo for slime_volley
to replace the low-performance serial_pipepine
by end of October as well.
Hello, this is a serious system design problem, which may require fundamental refactor of the framework. Need some guys familiar with framework and system design to work on this.
Could us know the current progress on this? @PaParaZz1
Hello, this is a serious system design problem, which may require fundamental refactor of the framework. Need some guys familiar with framework and system design to work on this.
Could us know the current progress on this? @PaParaZz1
We are preparing new design about main entry function and new benchmark results on cartpole will be updated at 11.8.
One open source framework released last week showed that EnvPool can get even 2x improvement over SampleFactory.
@PaParaZz1 Could us know the latest FPS results of di-engine please 😔
One open source framework released last week showed that EnvPool can get even 2x improvement over SampleFactory.
@PaParaZz1 Could us know the latest FPS results of di-engine please 😔
EnvPool
is cpp env manager
what I said before, and we are doing a new brave main entry refactor, so we delay the detailed asynchronous pipeline benchmark to the end of November. Besides, we will adapt a new env manager choice(DI-engine + EnvPool) this week, but EnvPool
only supports Atari now.
@PaParaZz1 Hello there, it's been 3 months long. We have to ask if there is any update on this ticket.
Hi, @zxzzz0, glad to see you are still following about this issue. In fact, we have been making some important improvements in the past three months, including process architecture optimization and objective horizontal comparison. First, please take a look at some of the results of our test:
data:image/s3,"s3://crabby-images/7190a/7190a9cef9bfc0cc37a9ead1fd5a5086e28ae744" alt="Screen Shot 2022-01-28 at 18 11 07"
All test code, reports and tensorboard screenshots will be open sourced on github in the future, so you can try it yourself.
Next I will try to explain the problems you found and our solution (and will be updated in 1.0 version).
RL efficiency is divided into two parts: environment collection speed and training speed, and we are concerned with maximizing the overall speed, because if the environment is constantly collected (using the entire CPU) and the training speed cannot keep up, it will only cause a large amount of invalid data. ,vice versa.
Therefore, we mainly solve this problem from two aspects. The first is to make the collection and training asynchronous, so that the model will not wait for the data of the environment, and the environment will not wait for the model to update, and the two can maintain a certain step difference. In fact, not only the environment and training, most of the steps within the entire pipeline can be asynchronous. This is not well supported in most frameworks as far as I know.
The second point is to reduce the waste of resources caused by the synchronization of the parent and child processes, that is what you said, the problem of CPU running dissatisfaction during collection. In fact, if you use the top command, the env manager does not always run down the CPU of the entire machine, but there will be periodic fluctuations in it. This fluctuation is due to the fact that only one parent process is used to collect all of the data in child processes. We will solve this issue of bottleneck in v1.0, and we can see the difference in cpu usage between the new execution mode (bottom of picture) and the previous execution mode (top of picture) through the upper and lower comparison charts.
data:image/s3,"s3://crabby-images/31abf/31abf3ff2944426150279523e128c11803cafb60" alt="Screen Shot 2022-01-28 at 18 30 21"
I hope these explanations will restore your confidence in DI-engine and help you in your future work, keep following us, thank you.
Best.