Antonin RAFFIN
Antonin RAFFIN
> It would be implemented to allow to load models trained with an old version of SB3? Yes, but probably for a separate PR. > The argument will still exist...
> Because the timeout is handled at sampling time for the classic replay buffer (normally). But we can probably do the same for the online sampling.
> I have one major concern though: the code is too slow! That's because you removed the max episode length, right? Btw, was that needed? Now the implementation seems really...
> (episode_idx, trans_idx, env_idx), then you would have to manage a list of indexes of transitions and episode for each environment. It seems pretty complicated. I didn't even try. Actually,...
> Let me check the current code, we might still reference the more consistent implementation for people interested but I'm not sure we will keep it mainly for the two...
Looking at the test, the +1 for the future strategy does actually make a big difference? (the performance test was failing before even with almost twice more budget)
> Btw, there is no reason that online sampling gives better results. There is. In fact, in my experience, the two are not equivalent (and the online sampling usually but...
> But, as @araffin suspected, it does not work when it is a SubprocVecEnv.. yes, because the `compute_reward` function is in another process, not accessible directly from the main one....
> Maybe it is just a statistical effect. Most probable yes. I would also do some testing with harder env (cf. the RL Zoo with highway env and then other...
hello, sorry for the late reply (I'm on holidays, i will try to write a longer answer next week), i know that our current architecture is not really flexible (that...