tianshou
tianshou copied to clipboard
Implement advice to call gpu reward function using SubprocVectorEnv
- [x] I have marked all applicable categories:
- [ ] exception-raising bug
- [ ] RL algorithm bug
- [ ] documentation request (i.e. "X is missing from the documentation.")
- [ ] new feature request
- [x] I have visited the source website
- [x] I have searched through the issue tracker for duplicates
- [x] I have mentioned version numbers, operating system and environment, where applicable:
import tianshou, torch, numpy, sys print(tianshou.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform) 0.4.4 1.9.1+cu111 1.20.3 3.9.7 (default, sep 16 2021, 13:09:58) [GCC 7.5.0] linux
Hello, thanks for your project! My task have n(n>=60) steps,and after n steps,the environment need to call a reward function which is implemented on GPU based pytorch. The reward function runs on single GPU. I want to use SubprocVectorEnv to speed up program. Do you have some precious advice?
Maybe I should set multiprocessing.set_start_method('spawn') in the initialization of SubprocVector?
- Split your env step into two parts:
- n_step forward
- GPU forward
- Create your env that only contains n_step forward
- Use Subprocess venv to create a venv
- Add a vectorized wrapper to subprocess venv that contains GPU forward
Thanks a lot for your quick reply. I am sorry to be not familiar with tianshou so much. So I don't understand about something. 1、What's the difference between env and venv? I think that venv is implemented with some envs. Such as for-loop(Dummy) or multiprocessing(Subproc). 2、So I create a env only contains n_step and create a venv(just like SubprocVectorEnv([lambda: gym.make(xxx), for _ in range(args.training_num)]))? 3、train_collector(policy.ENV,buffer),What's ENV in train_collector? env or venv?
Besides, my reward function is based on torch.nn.module, which is optimized by adam.
@hedy14 Hi. Trinkle's advice is brilliant and elegant. If you didn't fully understand, I can give some explanation in detail.
Solution
TL, DR: Create your env whose step
method actually does n
step, then use SubprocVectorEnv([your_env_factory] * training_num)
(factory might be lambda: gym.make(...)
) to create a vectorized version of your env, finally use a VectorEnvWrapper
to transform reward from a dummy value(e.g. 0) to the GPU reward(maybe based on observation and action).
Case 1:
If your policy can output full action sequence at once, without observation at each step.
class VectorEnvNormObs(VectorEnvWrapper):
def __init__(
self,
venv: BaseVectorEnv,
reward_function: torch.nn.module
) -> None:
super().__init__(venv)
self.model= reward_function
def step(
self,
action: np.ndarray,
id: Optional[Union[int, List[int], np.ndarray]] = None,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
obs, _, done, info = self.venv.step(action, id) # original reward is ignored
with torch.autograd.inference_mode():
# no grad, if you want to jointly optimize your reward, a more detailed description is needed.
# your model need to support batched input!
rew = self.model(torch.from_numpy(obs).to(self.model.device)).cpu().numpy()
return obs, rew, done, info
class YourEnv(gym.Env):
# detail omitted
def step(self, actions):
for action in actions: # take N step
inner_obs, inner_info = self.inner_step(action)
# compute obs, done, info
return obs, 0, done, info
train_envs = ShmemVectorEnv([lambda: YourEnv(arg) for _ in range(training_num)])
train_envs = VectorEnvNormObs(train_envs, reward_function)
train_collector = Collector(policy, train_envs, buffer, exploration_noise=True)
Case 2
If your policy need observation at each step.
I suggest you implement a custom policy, which simply ignore "reward"(using a dummy value, e.g. 0) from env, and compute the rewards from observations using GPU reward function in YourPolicy.learn
. Indeed, this method is also applicable in Case 1, and make joint optimization trivial.
Reference
I suggest you refer to
https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/tianshou/env/venv_wrappers.py#L67-L132
But instead of transforming observation in VectorEnvNormObs
, you need to transform reward. From a dummy value(e.g. 0) to the GPU reward, which might depend on observation and action.
For the usage of VectorEnvWrapper
, you can refer to https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/examples/mujoco/mujoco_env.py#L13-L41
Just like an ordinary gym.Wrapper
, but wrapping around vector env.