ray [RLlib] nan value error in compute gradient policy with PPO& large state and action space

What happened + What you expected to happen

I assume the error is related to the action space being large because I can not reproduce it when the action space is much smaller (ie 10 times fewer dimensions). I could reproduce the error with two steps per episode (self.max) and more. Obviously, I checked that the observations do not contain nans.

File "/home/ascardigli/.local/lib/python3.9/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1095, in _worker
    self.loss(model, self.dist_class, sample_batch)
  File "/home/ascardigli/.local/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 87, in loss
    curr_action_dist = dist_class(logits, model)
  File "/home/ascardigli/.local/lib/python3.9/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 239, in __init__
    self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
  File "/home/ascardigli/.local/lib/python3.9/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/ascardigli/.local/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (40, 921600)) of distribution Normal(loc: torch.Size([40, 921600]), scale: torch.Size([40, 921600])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<SplitBackward0>)

In tower 0 on device cpu

Versions / Dependencies

Python 3.9.4 torchvision 0.12.0+cu115 torch 1.11.0+cu115 ray 2.0.0 gym 0.23.1 numpy 1.23.2

Ubuntu 20.04.4 LTS

Reproduction script

The error persists even when setting the policy to 0 and the dataset to torch.zeros, which simplifies the code significantly.

import torch
import torchvision
import gym
from gym import Env, spaces
import numpy as np
import ray
import ray.rllib.algorithms.ppo as ppo

class PhysicSimulation:
    def __init__(self):
        self.HEIGHT = 720 
        self.WIDTH =  1280
        self.max = 2
        self.observations=torch.zeros((self.HEIGHT*self.WIDTH,3))
        self.reset()

    def reset(self):
        self.indexes = torch.zeros([self.HEIGHT, self.WIDTH], dtype = torch.int)
        self.indexes=self.indexes.view( -1, *self.indexes.shape[2:])
        self.count =0

    def simulate(self, x):
        x=x.flatten()
        max = np.quantile(x,.5)
        idx = np.where(x>=max)[0]
        indexes = self.indexes[idx] 
        self.indexes[idx]= indexes+1
        self.count+=1

    def out(self,data):
        return data.view(self.HEIGHT,self.WIDTH,*data.shape[1:]).numpy()    

    def render(self):
        return self.out(self.observations)

    def observe(self):
        temp = np.concatenate((self.render(),self.out((self.indexes/self.max).unsqueeze(-1))),axis=-1)
        return temp

class Spec:
   def __init__(self,max_episode_steps):
    self.max_episode_steps = max_episode_steps
    self.id = "foo"

class CustomEnv(gym.Env):
  metadata = {'render.modes': ['human']}
  def __init__(self, env_config):
    super(CustomEnv, self).__init__()
    self.simulation = PhysicSimulation()
    self.WIDTH = self.simulation.WIDTH
    self.HEIGHT = self.simulation.HEIGHT
    self.action_space = spaces.Box(low=0,high=1, shape=(int(self.HEIGHT*self.WIDTH),))
    self.observation_space = spaces.Box(low=-1e-6, high=1, shape=
                    (self.HEIGHT,self.WIDTH,4), dtype=np.float32)
    self.spec = Spec(self.simulation.max)

  def step(self, action):
    self.simulation.simulate(action)
    observation = self.simulation.observe()
    reward = 0 
    done = self.spec.max_episode_steps <= self.simulation.count
    print(np.isnan(observation).any())
    return observation,reward,done, {}
   
  def reset(self):
    self.simulation.reset()
    return self.simulation.observe()
   
ray.init(num_gpus=4)
def train_ppo_model():
    algo = ppo.PPO(env=CustomEnv,config={
          'framework' :"torch",
"num_envs_per_worker":1,
        'num_workers':4,
'num_gpus_per_worker':1,
"evaluation_interval":1,
"rollout_fragment_length":10, 
"train_batch_size":40, 
"sgd_minibatch_size":40,
  "model":{
"vf_share_layers":True,
    "conv_filters": [
        [16,[24,48], [21,36]],
        [32, [6, 6], 4],
        [256, [9, 9], 1],
    ]
    }
})
    algo.train()
train_ppo_model()

Issue Severity

High: It blocks me from completing my task.

Sep 04 '22 09:09 AJSVB

Same problem. It occurred at step 55M with PPO. All of the model parameters became nan.

for p in algo.get_policy().model.named_parameters():
    print(p)

('_logits._model.0.weight', Parameter containing:
tensor([[nan, nan, nan,  ...

Dec 17 '22 08:12 siyarvurucu

Exact same problem here also with PPO, any clues? The code works fine with smaller state spaces.

Jan 25 '23 15:01 brendk

I had the same issue and I found I was giving NaN rewards. When I fixed this the problem resolved.

Mar 23 '23 14:03 jaredvann

Any solution? I have the same problem. The state space is not that high: flattened tensor of 15 elements. I using rllib 2.4.0, torch2.0

Jun 14 '23 08:06 szkLaszlo

Same problem. Hard to debug. Any solution??? I need help!!!

ValueError: Expected parameter logits (Tensor of shape (1, 19)) of distribution Categorical(logits: torch.Size([1, 19])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values: tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

Jun 26 '23 09:06 JeremyLinky

I have come across this error and in my case it meant exploding gradients. I changed the reward function so that it doesn't have sharp jumps, and also set a gradient clip. This solved my problem. Perhaps you can also try to reduce lr. I see that in your case the reward is always zero, but it's still worth revisiting the hyperparameters. This may help the commenters above rather than the author. Since the author seems to have already solved the problem? If so, it will be correct if he writes a solution and closes the topic. (@AJSVB)

Jul 20 '23 21:07 koshachya-myata

I am also getting the same error.

Jul 21 '24 23:07 man2machine

ray ray copied to clipboard

[RLlib] nan value error in compute gradient policy with PPO& large state and action space

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

ray
ray copied to clipboard