tianshou icon indicating copy to clipboard operation
tianshou copied to clipboard

Implement advice to call gpu reward function using SubprocVectorEnv

Open hedy14 opened this issue 2 years ago • 3 comments

  • [x] I have marked all applicable categories:
    • [ ] exception-raising bug
    • [ ] RL algorithm bug
    • [ ] documentation request (i.e. "X is missing from the documentation.")
    • [ ] new feature request
  • [x] I have visited the source website
  • [x] I have searched through the issue tracker for duplicates
  • [x] I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, torch, numpy, sys
    print(tianshou.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
    0.4.4 1.9.1+cu111 1.20.3 3.9.7 (default, sep 16 2021, 13:09:58) [GCC 7.5.0] linux
    

Hello, thanks for your project! My task have n(n>=60) steps,and after n steps,the environment need to call a reward function which is implemented on GPU based pytorch. The reward function runs on single GPU. I want to use SubprocVectorEnv to speed up program. Do you have some precious advice?

Maybe I should set multiprocessing.set_start_method('spawn') in the initialization of SubprocVector?

hedy14 avatar Aug 30 '22 12:08 hedy14

  1. Split your env step into two parts:
    • n_step forward
    • GPU forward
  2. Create your env that only contains n_step forward
  3. Use Subprocess venv to create a venv
  4. Add a vectorized wrapper to subprocess venv that contains GPU forward

Trinkle23897 avatar Aug 31 '22 03:08 Trinkle23897

Thanks a lot for your quick reply. I am sorry to be not familiar with tianshou so much. So I don't understand about something. 1、What's the difference between env and venv? I think that venv is implemented with some envs. Such as for-loop(Dummy) or multiprocessing(Subproc). 2、So I create a env only contains n_step and create a venv(just like SubprocVectorEnv([lambda: gym.make(xxx), for _ in range(args.training_num)]))? 3、train_collector(policy.ENV,buffer),What's ENV in train_collector? env or venv?

Besides, my reward function is based on torch.nn.module, which is optimized by adam.

hedy14 avatar Aug 31 '22 03:08 hedy14

@hedy14 Hi. Trinkle's advice is brilliant and elegant. If you didn't fully understand, I can give some explanation in detail.

Solution

TL, DR: Create your env whose step method actually does n step, then use SubprocVectorEnv([your_env_factory] * training_num)(factory might be lambda: gym.make(...)) to create a vectorized version of your env, finally use a VectorEnvWrapper to transform reward from a dummy value(e.g. 0) to the GPU reward(maybe based on observation and action). Case 1: If your policy can output full action sequence at once, without observation at each step.

class VectorEnvNormObs(VectorEnvWrapper):
    def __init__(
        self,
        venv: BaseVectorEnv,
        reward_function: torch.nn.module
    ) -> None:
        super().__init__(venv)
        self.model= reward_function

    def step(
        self,
        action: np.ndarray,
        id: Optional[Union[int, List[int], np.ndarray]] = None,
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
        obs, _, done, info = self.venv.step(action, id) # original reward is ignored
        with torch.autograd.inference_mode():
        # no grad, if you want to jointly optimize your reward, a more detailed description is needed.
        # your model need to support batched input!
            rew = self.model(torch.from_numpy(obs).to(self.model.device)).cpu().numpy()
        return obs, rew, done, info

class YourEnv(gym.Env):
    # detail omitted
    def step(self, actions):
        for action in actions: # take N step
            inner_obs, inner_info = self.inner_step(action)
        # compute obs, done, info
        return obs, 0, done, info
        

train_envs = ShmemVectorEnv([lambda: YourEnv(arg) for _ in range(training_num)])
train_envs = VectorEnvNormObs(train_envs, reward_function)
train_collector = Collector(policy, train_envs, buffer, exploration_noise=True)

Case 2 If your policy need observation at each step. I suggest you implement a custom policy, which simply ignore "reward"(using a dummy value, e.g. 0) from env, and compute the rewards from observations using GPU reward function in YourPolicy.learn. Indeed, this method is also applicable in Case 1, and make joint optimization trivial.

Reference

I suggest you refer to https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/tianshou/env/venv_wrappers.py#L67-L132 But instead of transforming observation in VectorEnvNormObs, you need to transform reward. From a dummy value(e.g. 0) to the GPU reward, which might depend on observation and action. For the usage of VectorEnvWrapper, you can refer to https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/examples/mujoco/mujoco_env.py#L13-L41 Just like an ordinary gym.Wrapper, but wrapping around vector env.

YouJiacheng avatar Sep 08 '22 16:09 YouJiacheng