ElegantRL
ElegantRL copied to clipboard
self.states[0]
@Yonv1943 self.states[0] is very difficult to understand. Please give a better method. https://github.com/AI4Finance-Foundation/ElegantRL/blob/98b83959cc6e62f17e7c2ad104c5050e84bd7297/helloworld/helloworld_PPO_single_file.py#L194
In stable baselines 3, they set self._last_obs
in their training pipeline class OffPolicyAlgorithm
or class OnPolicyAlgorithm
stable baselines 3 在他们的代码中,也有 self._last_obs
的设置,如下:
self._last_obs = new_obs
# Save the unnormalized observation
if self._vec_normalize_env is not None:
self._last_original_obs = new_obs_
https://github.com/DLR-RM/stable-baselines3/blob/18b29a68e8d5a11d0e98aeea539c247f0a913019/stable_baselines3/common/off_policy_algorithm.py#L517-L520
The training pipeline needs to build a variance in __init__
. This variance helps to save the previous observation state for training.
具有训练流程功能的这个类,它需要在__init__
里面新建一个变量,用来记录 上一条trajectory 未探索完成的变量 last_obs
The training pipeline class can continue to explore from the position in the previous trajectory, that has not ended. Avoid using reset every time, so that the agent can only explore the first part of the trajectory.
具有训练流程功能的这个类,在有了lasdt_obs
这个变量后,可以从从上一条trajectory中还没有结束的位置继续探索。避免“每一次都使用 reset,让智能体只能探索到trajectory的前半截”。
天授库 对 last_obs 的处理会更加复杂一些:
他们在class Collector
这个类__init__
的时候:
- 调用了
self.reset()
https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/tianshou/data/collector.py#L77 - 调用了
self.reset_env
https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/tianshou/data/collector.py#L120 - 然后也是赋值了
self.data.obs
https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/tianshou/data/collector.py#L154
赋值的语句在: https://github.com/thu-ml/tianshou/blob/278c91a2228a46049a29c8fa662a467121680b10/tianshou/data/collector.py#L136-L154
if returns_info:
obs, info = rval
if self.preprocess_fn:
processed_data = self.preprocess_fn(
obs=obs, info=info, env_id=np.arange(self.env_num)
)
obs = processed_data.get("obs", obs)
info = processed_data.get("info", info)
self.data.info = info
else:
obs = rval
if self.preprocess_fn:
obs = self.preprocess_fn(obs=obs, env_id=np.arange(self.env_num
)).get("obs", obs)
self.data.obs = obs
所以有: DRL library --> ElegantRL: AgentBase.py --> self.states StableBaselines3: OffPolicyAlgorithm or OnPolicyAlgorithm --> self.last_obs TianShou: BaseTrainer --> Collector --> self.reset() --> self.env_reset() --> self.data
In stable baselines 3, they set
self._last_obs
in their training pipelineclass OffPolicyAlgorithm
orclass OnPolicyAlgorithm
stable baselines 3 在他们的代码中,也有self._last_obs
的设置,如下:
- The variable name
self._last_obs
is much clearer than self.states[0].
The training pipeline needs to build a variance in
__init__
. This variance helps to save the previous observation state for training. 具有训练流程功能的这个类,它需要在__init__
里面新建一个变量,用来记录 上一条trajectory 未探索完成的变量last_obs
-
last_obs
is a part of the training process. It is neither reasonable nor clear to put this variable inclass AgentBase
.
The training pipeline class can continue to explore from the position in the previous trajectory, that has not ended. Avoid using reset every time, so that the agent can only explore the first part of the trajectory. 具有训练流程功能的这个类,在有了
lasdt_obs
这个变量后,可以从从上一条trajectory中还没有结束的位置继续探索。避免“每一次都使用 reset,让智能体只能探索到trajectory的前半截”。
- Continuing exploring from the last observation is a good functionality.
The follow Pull request fix this bug ↓ Fix bug for vec env and agentbase init https://github.com/AI4Finance-Foundation/ElegantRL/pull/248