rl
rl copied to clipboard
[WIP] Hindsight Experience Replay Transform
Description
Adds Hindsight Experience Replay (HER) Transform
Motivation and Context
The first draft for the HER transform. However, I am not sure if it should be a Transform
or if we create an extra Augmentation
class as we are not transforming a single element in the tensordict but augmenting existing collection data. Could be interesting for future "data augmentation strategies", which I think we do not have until now.
Types of changes
What types of changes does your code introduce? Remove all that do not apply:
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds core functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation (update in the documentation)
- [ ] Example (update in the folder of examples)
Checklist
Go over all the following points, and put an x
in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!
- [ ] I have read the CONTRIBUTION guide (required)
- [ ] My change requires a change to the documentation.
- [ ] I have updated the tests accordingly (required for a bug fix or a new feature).
- [ ] I have updated the documentation accordingly.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/1819
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:x: 3 New Failures, 19 Unrelated Failures
As of commit 90eef759458534b18f26d18f6712e41d548b9280 with merge base 57139bd994bdac62b76a707cb1cf6e7daf7016fd ():
NEW FAILURES - The following jobs have failed:
-
Wheels / build-wheel-mac (3.10, 3.10.3) (gh)
##[error]The operation was canceled.
-
Wheels / build-wheel-mac (3.8, 3.8) (gh)
##[error]The operation was canceled.
- Wheels / build-wheel-mac (3.9, 3.9) (gh)
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
-
Continuous Benchmark (PR) / CPU Pytest benchmark (gh)
Workflow failed! Resource not accessible by integration
-
Continuous Benchmark (PR) / GPU Pytest benchmark (gh)
Workflow failed! Resource not accessible by integration
-
Examples Tests on Linux / tests (3.9, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t 184df0bbc8416d4f6dffaaa35d1ffa8c1c3da392c7de351de1865a4a86e8a267 /exec failed with exit code 1
-
Habitat Tests on Linux / tests (3.9, 11.6) / linux-job (gh)
RuntimeError: Command docker exec -t 3258c420522b3517fb8db3ff7719123fd31be6c04aff78f84940c3720934e876 /exec failed with exit code 139
-
Lint / python-source-and-configs / linux-job (gh)
RuntimeError: Command docker exec -t 674e3f0b4d94994149bc3c569aff7186428763cc7b6d437e2b4a4aadf9cab60b /exec failed with exit code 1
-
Unit-tests on Linux / tests-cpu (3.10) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-cpu (3.11) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-cpu (3.8) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-cpu (3.9) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-gpu (3.8, 12.1) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-olddeps (3.8, 11.6) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-optdeps (3.9, 12.1) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Linux / tests-stable-gpu (3.8, 11.8) / linux-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on MacOS CPU / tests (3.11) / macos-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on MacOS CPU / tests (3.8) / macos-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Windows / unittests-cpu / windows-job (gh)
test/test_trainer.py::TestRB::test_rb_trainer_save[True-torch-list-True]
-
Unit-tests on Windows / unittests-gpu / windows-job (gh)
##[error]The operation was canceled.
-
Wheels / build-wheel-mac (3.11, 3.11) (gh)
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/torch/include/ATen/core/function_schema.h:603:46: error: 'value' is unavailable: introduced in macOS 10.13
BROKEN TRUNK - The following job failed but were present on the merge base:
👉 Rebase onto the `viable/strict` branch to avoid these failures
-
Libs Tests on Linux / unittests-sklearn (3.9, 12.1) / linux-job (gh)
test/test_libs.py::TestOpenML::test_data[mushroom_onehot]
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@ahmed-touati suggested we use a sampler for this rather than a transform. I'm not strongly opinionated on the matter, mostly because I need more context on what we're trying to achieve here. Can you elaborate a bit more on what this transform does, maybe with a bunch of examples?
So HER is mainly used in goal-conditioned RL with sparse reward signals where the agent has to reach/achieve a goal state and only gets a reward (+1) when the goal state is achieved, otherwise no reward. The observation consists of three elements: the observation the agent sees, the state the agent had (could be x,y,z position), and the goal state the agent should reach (x,y,z). A typical task could be a robot that has to reach a goal position. The observation will include the agent position but its mostly added as additional information also helps here for understanding.
Now as we have a sparse reward function most of the trajectories will have no learning signal for the agent as it might not be possible for the agent to reach the goal position randomly or by pure luck. What HER now does is that for each step in a trajectory that you want to add to the buffer, you sample a new goal state and pretend that this was the actual goal the agent had to reach.
So lets say you have a real transition (obs, action, reward, done, next obs, achieved_position, goal_position) for this tuple you now want to sample a new goal_position and then calculate the reward based on this new goal_position and the real achieved_position. So you then add the real transition (obs, action, reward, done, next obs, achieved_position, goal_position) but also the HER augmented transition (obs, action, new_reward, done, new next obs, achieved_position, new_goal_position). The sampling can happen in different ways but is not important for now. However, I think important will be that we need the reward function, Im not sure if we can pass it to the writer/sampler for the buffer, that's why my first thought was a transform. Most of the time the reward function might just be Euclidean distance but maybe for other tasks the user needs to provide a more sophisticated reward function.
However, I think important will be that we need the reward function, Im not sure if we can pass it to the writer/sampler for the buffer, that's why my first thought was a transform.
Why not? I would guess that even if it's a complex nn.Module you can still do pretty much everything with a well tailored function (at least nothing less than with a transform).
Thanks for the context btw!
Why not? I would guess that even if it's a complex nn.Module you can still do pretty much everything with a well tailored function (at least nothing less than with a transform).
Revisiting this I think it would make much more sense to do it with a writer. We want to augment current incoming data with new sampled goal states and store them all together in the buffer. I think this would be generally a good way to add other data augmentation strategies with writer instead of transforms. Having a closer look right now on the writer classes and will update the code here
But this would not allow us to stack multiple augmentations on top of each other... so maybe not that ideal for augmentations
You could still transform your data before passing it to the writer, but not after
Not sure about that one. We've had that request already 3 times but for 3 different purposes so if there's a way to make it a modular component of the lib I'd like to consider that above a script that is harder to reuse (and more error prone on the user side)