[BUG]ClipPPOLoss encounters an bug when calculating the loss in a composite action space scenario, failing to produce the correct result.
Describe the bug
When I proceeded further based on the issue https://github.com/pytorch/rl/issues/2402 I have already made modifications based on the suggestions according to the issue above, but my code encounters issues when calculating the PPO loss. After debugging, I found that the line gain1 = log_weight.exp() * advantage always results in a tensor where all values are zero. I also discovered that this might be due to the fact that the result from return self.log_prob_composite(sample, include_sum=True) is too small (e.g., -300, -200, etc.). I can't figure out why self.log_prob_composite computes such values, and I hope someone can help me with this issue.
System info
win11 24h2 python 3.10.14
torch 2.4.0+cu118 torchaudio 2.4.0+cu118 torchrl 0.5.0+ca3a595 torchvision 0.19.0+cu118 tensordict 0.5.0+eba0769
It seems that I have identified an issue: when a ClipPPOLoss instance object performs the batch forward function, there appears to be a problem with the log_prob function of the CompositeDistribution. The tensor dimensions passed to the dist.log_prob part of the code seem to be incorrect, resulting in abnormal values when calculating the lp. When I modified the code from lp = dist.log_prob(sample.get(name)) to lp = dist.log_prob(sample.get(name).squeeze(-1)), the training loss curve began to change according to adjustments in the learning rate. If I do not make this change, altering the learning rate has no effect on the learning rate curve during training, even with the random seed unchanged. In other words, performing training ten times with different learning rates results in identical loss curves.
Here is the complete log_prob function code after my modifications.
def log_prob(
self, sample: TensorDictBase, *, aggregate_probabilities: bool | None = None
) -> torch.Tensor | TensorDictBase: # noqa: D417
if aggregate_probabilities is None:
aggregate_probabilities = self.aggregate_probabilities
if not aggregate_probabilities:
return self.log_prob_composite(sample, include_sum=True)
slp = 0.0
for name, dist in self.dists.items():
if sample.get(name).ndim>1:
lp = dist.log_prob(sample.get(name).squeeze(-1))
else:
lp = dist.log_prob(sample.get(name))
if lp.ndim > sample.ndim:
lp = lp.flatten(sample.ndim, -1).sum(-1)
slp = slp + lp
return slp
Interesting, so with your change it learns fine?
I think what could be happening is that your sample tensors have more dimensions than the distribution parameters, and then some broadcasting is done by PyTorch.
Are you able to change the action_spec in your code not to have this dimension you are removing with squeeze(-1) in your actions? It would be interesting to see if that solves the problem. I remember in the code I provided in https://github.com/pytorch/rl/issues/2402, I had to modify a bit the structure of the action space.
Interesting, so with your change it learns fine?
I think what could be happening is that your sample tensors have more dimensions than the distribution parameters, and then some broadcasting is done by PyTorch.
Are you able to change the
action_specin your code not to have this dimension you are removing withsqueeze(-1)in your actions? It would be interesting to see if that solves the problem. I remember in the code I provided in #2402, I had to modify a bit the structure of the action space.
I am not sure if the modifications will allow the model to train and converge properly, because I am a beginner, and also because there is this warning in issue 2458 (https://github.com/pytorch/rl/issues/2458). However, I am confident that it performs better or more normally than before modifying the log_prob function. Additionally, my code has indeed been modified according to the suggestions you provided in issue 2402 (https://github.com/pytorch/rl/issues/2402) to redefine self.action_spec in the _make_spec function.