DI-engine CPU utilization problem

[ ] I have marked all applicable categories:
- [ ] exception-raising bug
- [ ] RL algorithm bug
- [x] system worker bug
- [x] system utils bug
- [ ] code design/refactor
- [ ] documentation request
- [ ] new feature request
[x] I have visited the readme and doc
[x] I have searched through the issue tracker and pr tracker
[x] I have mentioned version numbers, operating system and environment, where applicable:

  # ding version `v0.2.0`, linux platform

Issue Description

CPU utilization is not 100% and very low. (below 5% on average)

Steps to Reproduce

clone the repo and git checkout main. (currently on 0fcfdf26). Run python3 dizoo/slime_volley/entry/slime_volley_selfplay_ppo_main.py. Open htop to check CPU usage. Only one core is occupied on a multi-core machine.

What Do We Need?

During training, run command mpstat 3. The column of %idle is less than 20% (Current value is 97%)

Oct 20 '21 20:10 zxzzz0

Conclusion: SlimeVolley-v0 env is too tiny to fully utilize cpu when training.

You can try this test file for env, just run pytest -sv . and run htop in another terminal to check usage:

import pytest
import numpy as np
from easydict import EasyDict

from dizoo.slime_volley.envs.slime_volley_env import SlimeVolleyEnv


@pytest.mark.envtest
class TestSlimeVolley:

    @pytest.mark.parametrize('agent_vs_agent', [True, False])
    def test_slime_volley(self, agent_vs_agent):
        total_rew = 0
        env = SlimeVolleyEnv(EasyDict({'env_id': 'SlimeVolley-v0', 'agent_vs_agent': agent_vs_agent}))
        # env.enable_save_replay('replay_video')
        obs1 = env.reset()
        done = False
        print(env._env.observation_space)
        print('observation is like:', obs1)
        done = False
        while True:
            if agent_vs_agent:
                action1 = np.random.randint(0, 2, (1, ))
                action2 = np.random.randint(0, 2, (1, ))
                action = [action1, action2]
            else:
                action = np.random.randint(0, 2, (1, ))
            import time
            time.sleep(0.01)
            observations, rewards, done, infos = env.step(action)
            total_rew += rewards[0]
            obs1, obs2 = observations[0], observations[1]
            assert obs1.shape == obs2.shape, (obs1.shape, obs2.shape)
            if agent_vs_agent:
                agent_lives, opponent_lives = infos[0]['ale.lives'], infos[1]['ale.lives']
        if agent_vs_agent:
            assert agent_lives == 0 or opponent_lives == 0, (agent_lives, opponent_lives)
        print("total reward is:", total_rew)

If you run this file directly, you will find CPU usage is just ~5% like the following screenshot:

But if you comment the code time.sleep(0.01), and you will find ~100% CPU usage.

Note RL pipeline for collecting data is usually like this:

while True:
    action = policy.forward(obs)
    obs, rew, done, info = env.step(action)
    ...

Env and policy are called alternately, If env is too tiny, and the time of policy forward can't be less then 0.01s, the usage of CPU will very low like your case, so it's not the fault of DI-engine SyncSubprocessEnvManager.

I want to know your plan or target about training speed, cpu usage is not a good metric. Maybe other viewpoints can help you.

Oct 22 '21 16:10 PaParaZz1

Plan/Target

Renting a 64 cores machine is not cheap. Overall goal is not to waste any CPU resources (no core is idle) and hence make convergence faster.

First, I think you misunderstood the issue. This is not about whether single core is 100% or not but rather talking about all cores 100% or not.

But if you comment the code time.sleep(0.01), and you will find ~100% CPU usage.

I commented the time.sleep(0.01) and saw 1 core is at 100%. However remaining 63 cores are still nearly 0%.

Second, there are several issues here. Let me break them down a little bit.

1. Speed Benchmark

1.1 SubprocVecEnv Speed

Below is an example using SubprocVecEnv from stable-baselines3 to train on CartPole-v1 . The result is that multi cores version trained 30x faster.

Step 1: Run pip3 install stable-baselines3[extra] Step 2: Create a file main.py with the content below and run python3 main.py

import time

import gym
import numpy as np

from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.env_util import make_vec_env

import multiprocessing

from typing import Callable

def make_env(env_id: str, rank: int, seed: int = 0) -> Callable:
    """
    Utility function for multiprocessed env.
    
    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environment you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    :return: (Callable)
    """
    def _init() -> gym.Env:
        env = gym.make(env_id)
        env.seed(seed + rank)
        return env
    set_random_seed(seed)
    return _init

env_id = "CartPole-v1"
num_cpu = multiprocessing.cpu_count()  # Number of processes to use


if __name__ == '__main__':
	# Create the vectorized environment
	env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

	model = A2C('MlpPolicy', env, verbose=0)

	# We create a separate environment for evaluation
	eval_env = gym.make(env_id)

	n_timesteps = 25000

	# Multiprocessed RL Training
	start_time = time.time()
	model.learn(n_timesteps)
	total_time_multi = time.time() - start_time

	print(f"Took {total_time_multi:.2f}s for multiprocessed version - {n_timesteps / total_time_multi:.2f} FPS")

	# Single Process RL Training
	single_process_model = A2C('MlpPolicy', env_id, verbose=0)

	start_time = time.time()
	single_process_model.learn(n_timesteps)
	total_time_single = time.time() - start_time

	print(f"Took {total_time_single:.2f}s for single process version - {n_timesteps / total_time_single:.2f} FPS")

	print("Multiprocessed training is {:.2f}x faster!".format(total_time_single / total_time_multi))

Output on my machine is


Took 1.96s for multiprocessed version - 12777.25 FPS
Took 59.70s for single process version - 418.80 FPS
Multiprocessed training is 30.51x faster!

1.2 SampleFactory Speed

Sample Factory is another RL framework. You can follow the following step to produce FPS for CartPole-v1. Step 1: Run

git clone https://github.com/alex-petrenko/sample-factory.git
cd sample-factory
conda env create -f environment.yml
conda activate sample-factory

(Don't use pip here as pip might not be working) Step 2: Run

python -m sample_factory_examples.train_gym_env --algo=APPO --use_rnn=False --num_envs_per_worker=20 --policy_workers_per_policy=2 --recurrence=1 --with_vtrace=False --batch_size=512 --hidden_size=256 --encoder_type=mlp --encoder_subtype=mlp_mujoco --reward_scale=0.1 --save_every_sec=10 --experiment_summaries_interval=10 --experiment=example_gym_cartpole-v1 --env=gym_CartPole-v1

Output on my machine is

FPS is (10 sec: 42758.9, 60 sec: 43001.2, 300 sec: 43001.2). Total num frames: 1420800. 
Throughput: 0: 47068.9. Samples: 1388960. Policy #0 lag: (min: 6.0, avg: 51.4, max: 105.0)                                                                                            
Avg episode reward: [(0, '483.310')]

which means FPS is 43001.2

1.3 DI-engine Speed

Running ding -m serial -c cartpole_a2c_config.py -s 0 gives me somthing like avg_envstep_per_sec: 2514.50 which means FPS is 2514.50

All above results were generated on a same machine and were running on same env CartPole-v1. However, as we can see that DI-engine is 17 times slower. This should be fixed ASAP. Update SyncSubprocessEnvManager with SubprocVecEnv would be a good start. After that we can see if we can learn from SampleFactory.

Oct 23 '21 06:10 zxzzz0

Our efficiency experiments on Atari/Mujoco environments are comparable with stable-basline3/tianshou. And our mock test benchmark results are shown in this link.

For you test results in cartpole env, I will go deeper to validate why leads to FPS differences.

SampleFactory is indeed the highest throughput example in APPO setting now. Collector(actor)-Learner asynchronously is a key point while our serial_pipeline is a serial demo. We plan to add parallel training demo for Atari/Mujoco/SMAC in December.

For env manager (vec env), python multiprocessing and threading is not a good choice (due to GIL and python IPC). I have talked about this problem with tianshou's author last week, and we think a new env manager based on cpp thread pool is necessary, which will be a important new feature in November/December.

Oct 24 '21 15:10 PaParaZz1

Thanks for the information.

Our efficiency experiments on Atari/Mujoco environments are comparable with stable-basline3/tianshou

This is good. But this only tests the collector. We should test the throughput of whole training end to end. This could be done sometime next year.

For env manager (vec env), python multiprocessing and threading is not a good choice

Could we please at least match SubprocVecEnv's performance by end of October? We are still 5 times slower than SubprocVecEnv, which is still Python not C++.

our serial_pipeline is a serial demo

Agree that we need asynchrony which could be achieved by starting two multiprocessing.Process. I wish we could have a MVP (minimum viable product) demo for slime_volley to replace the low-performance serial_pipepine by end of October as well.

Oct 25 '21 05:10 zxzzz0

Hello, this is a serious system design problem, which may require fundamental refactor of the framework. Need some guys familiar with framework and system design to work on this.

Could us know the current progress on this? @PaParaZz1

Nov 03 '21 04:11 zxzzz0

Hello, this is a serious system design problem, which may require fundamental refactor of the framework. Need some guys familiar with framework and system design to work on this.

Could us know the current progress on this? @PaParaZz1

We are preparing new design about main entry function and new benchmark results on cartpole will be updated at 11.8.

Nov 04 '21 15:11 PaParaZz1

One open source framework released last week showed that EnvPool can get even 2x improvement over SampleFactory.

@PaParaZz1 Could us know the latest FPS results of di-engine please 😔

Nov 15 '21 05:11 zxzzz0

One open source framework released last week showed that EnvPool can get even 2x improvement over SampleFactory.

@PaParaZz1 Could us know the latest FPS results of di-engine please 😔

EnvPool is cpp env manager what I said before, and we are doing a new brave main entry refactor, so we delay the detailed asynchronous pipeline benchmark to the end of November. Besides, we will adapt a new env manager choice(DI-engine + EnvPool) this week, but EnvPool only supports Atari now.

Nov 15 '21 15:11 PaParaZz1

@PaParaZz1 Hello there, it's been 3 months long. We have to ask if there is any update on this ticket.

Jan 18 '22 18:01 zxzzz0

Hi, @zxzzz0, glad to see you are still following about this issue. In fact, we have been making some important improvements in the past three months, including process architecture optimization and objective horizontal comparison. First, please take a look at some of the results of our test:

All test code, reports and tensorboard screenshots will be open sourced on github in the future, so you can try it yourself.

Next I will try to explain the problems you found and our solution (and will be updated in 1.0 version).

RL efficiency is divided into two parts: environment collection speed and training speed, and we are concerned with maximizing the overall speed, because if the environment is constantly collected (using the entire CPU) and the training speed cannot keep up, it will only cause a large amount of invalid data. ,vice versa.

Therefore, we mainly solve this problem from two aspects. The first is to make the collection and training asynchronous, so that the model will not wait for the data of the environment, and the environment will not wait for the model to update, and the two can maintain a certain step difference. In fact, not only the environment and training, most of the steps within the entire pipeline can be asynchronous. This is not well supported in most frameworks as far as I know.

The second point is to reduce the waste of resources caused by the synchronization of the parent and child processes, that is what you said, the problem of CPU running dissatisfaction during collection. In fact, if you use the top command, the env manager does not always run down the CPU of the entire machine, but there will be periodic fluctuations in it. This fluctuation is due to the fact that only one parent process is used to collect all of the data in child processes. We will solve this issue of bottleneck in v1.0, and we can see the difference in cpu usage between the new execution mode (bottom of picture) and the previous execution mode (top of picture) through the upper and lower comparison charts.

I hope these explanations will restore your confidence in DI-engine and help you in your future work, keep following us, thank you.

Best.

Jan 28 '22 10:01 sailxjx

DI-engine DI-engine copied to clipboard

CPU utilization problem

Issue Description

Steps to Reproduce

What Do We Need?

Plan/Target

1. Speed Benchmark

1.1 SubprocVecEnv Speed

1.2 SampleFactory Speed

1.3 DI-engine Speed

DI-engine
DI-engine copied to clipboard