super-ml-pets icon indicating copy to clipboard operation
super-ml-pets copied to clipboard

Training seem to crash occasionally

Open andreped opened this issue 3 years ago • 3 comments

When training RL models using sapai-gym, different errors tend to occur.

I have tried to uses try-expect blocks, but the problem is that if this happens, training using standard baseline 3 crashes, and we will have to start all over again.

We should therefore either: 1) fix what is bugged in sapai/sapai-gym or 2) add a wrapper function that catches when this fails, and tries to generate a new one (if possible).

andreped avatar Aug 08 '22 16:08 andreped

I've added a temporary fix for this, which essentially catches when this happens, and restarts training from the previous state, keeping all model history and whatnot.

Need a proper fix for this in sapai/sapai-gym.

andreped avatar Aug 18 '22 22:08 andreped

As I assumed all errors were coming from sapai-gym, I added a fix to catch all errors happening there: https://github.com/andreped/sapai-gym/commit/7443f36944466316efbb5f0c35d91593cc7a50e5

However, to my surprise, when running a regular training (now without the try/except loop in the main training script train_agent.py, I got an error from within sb3. This is more challenging to solve. Not really sure what is causing it. See error prompt below after about 250k steps:

Traceback (most recent call last):
  File ".\main.py", line 28, in <module>
    train_with_masks(ret)
  File "C:\Users\andrp\workspace\super-ml-pets\src\train_agent.py", line 60, in train_with_masks
    model.learn(total_timesteps=ret.nb_steps, callback=checkpoint_callback)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 579, in learn
    self.train()
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 439, in train
    values, log_prob, entropy = self.policy.evaluate_actions(
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\policies.py", line 280, in evaluate_actions
    distribution.apply_masking(action_masks)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 152, in apply_masking
    self.distribution.apply_masking(masks)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 62, in apply_masking
    super().__init__(logits=logits)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\torch\distributions\categorical.py", line 64, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\torch\distributions\distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter probs (Tensor of shape (64, 213)) of distribution MaskableCategorical(probs: torch.Size([64, 213]), logits: torch.Size([64, 213])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[4.9590e-11, 2.1976e-10, 6.1887e-01,  ..., 3.3524e-13, 4.5890e-12,
         5.3164e-14],
        [1.4266e-06, 8.7648e-10, 1.3233e-06,  ..., 1.5695e-07, 2.9451e-08,
         1.5212e-07],
        [2.2623e-06, 2.3994e-09, 5.3787e-07,  ..., 3.9735e-08, 2.8777e-09,
         2.6170e-08],
        ...,
        [1.6828e-12, 4.9032e-04, 9.5983e-13,  ..., 1.7402e-13, 1.9223e-13,
         5.6725e-14],
        [4.7819e-10, 7.7589e-03, 7.8509e-18,  ..., 6.4911e-11, 8.8994e-12,
         8.3013e-11],
        [3.6789e-08, 1.2760e-07, 4.7924e-16,  ..., 8.6682e-09, 8.6489e-10,
         3.7913e-08]], grad_fn=<SoftmaxBackward0>)

andreped avatar Aug 20 '22 18:08 andreped

Random Exception seem to happen after training thousands of steps:

Exception: get_idx < pet-hedgehog 10-1 status-honey-bee 2-1 > not found

What is causing this?

andreped avatar Aug 21 '22 10:08 andreped