rl [BugFix] Target return for sequential td

[BugFix] Target return for sequential td

Open BY571 opened this issue 1 year ago • 16 comments

Description

Target return calculation for sequential tensordict with time dimension did not work properly.

Motivation and Context

When using a sequential tensordict with time dimension (batch, time, feature) the reward was subtracted over the time dimension which is wrong. It should only subtract the reward from the most recent target return in time.

[ ] I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

[X] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds core functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Documentation (update in the documentation)
[ ] Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply. If you are unsure about any of these, don't hesitate to ask. We are here to help!

[X] I have read the CONTRIBUTION guide (required)
[ ] My change requires a change to the documentation.
[ ] I have updated the tests accordingly (required for a bug fix or a new feature).
[ ] I have updated the documentation accordingly.

Apr 24 '23 07:04 BY571

rl rl copied to clipboard

[BugFix] Target return for sequential td

Description

Motivation and Context

Types of changes

Checklist

rl
rl copied to clipboard