notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

Training SegFormer model not working (goes through notebook, but model loss becomes nan) on dataset I created (stuck for a week or so)

Open realharryhero opened this issue 1 year ago • 5 comments

When trying to train a SegFormer model on this notebook, changing the variable ds to some contrails datasets that I have been sending to huggingface, such as this one, the model's loss turns to nan (and perhaps (?) it sometimes crashes after training the first epoch).

This does not occur when training segment.ai's sidewalks dataset. This may have something to do with some differences in my segmentation bitmaps or some issues with the duckdb files (the duckdb files seem to be formatted differently on the sidewalks dataset compared to my contails dataset).

Why does this occur?

(I obtained the contrails images from this competition's dataset.)

realharryhero avatar Dec 30 '23 00:12 realharryhero

@sayakpaul

realharryhero avatar Dec 30 '23 00:12 realharryhero

Try lowering down the learning rate.

sayakpaul avatar Dec 30 '23 03:12 sayakpaul

The model's loss still becomes nan even with 10x (1000x?) lower learning rate than what was originally in the notebook. A few errors also occur; a screenshot and some text describing the error are below.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[30], line 1
----> 1 model.fit(
      2     train_set,
      3     validation_data=val_set,
      4     callbacks=callbacks,
      5     epochs=epochs,
      6 )

File ~/jupyter/miniconda3/envs/tf3.10new/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File ~/jupyter/miniconda3/envs/tf3.10new/lib/python3.10/site-packages/transformers/keras_callbacks.py:256, in KerasMetricCallback.on_epoch_end(self, epoch, logs)
    253 all_preds = self._postprocess_predictions_or_labels(prediction_list)
    254 all_labels = self._postprocess_predictions_or_labels(label_list)
--> 256 metric_output = self.metric_fn((all_preds, all_labels))
    257 if not isinstance(metric_output, dict):
    258     raise TypeError(
    259         f"metric_fn should return a dict mapping metric names to values but instead returned {metric_output}"
    260     )

Cell In[27], line 29
     25 per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
     26 per_category_iou = metrics.pop("per_category_iou").tolist()
     28 metrics.update(
---> 29     {f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}
     30 )
     31 metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
     32 return {"val_" + k: v for k, v in metrics.items()}

Cell In[27], line 29
     25 per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
     26 per_category_iou = metrics.pop("per_category_iou").tolist()
     28 metrics.update(
---> 29     {f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}
     30 )
     31 metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
     32 return {"val_" + k: v for k, v in metrics.items()}

KeyError: 2
SegFormer model not training screenshot

realharryhero avatar Dec 30 '23 15:12 realharryhero

I think I figured it out; the labels file I used had pixel value 255 as contrails, pixel value 1 as another ("filler") class, and pixel value 0 as unlabeled. But I think I needed to have a pixel value 2 as contrails, to have the pattern "0 1 2 3 ...".

Sort of "closed," but this is a very dumb issue. Any way to fix it in the future? Shouldn't take too long to change some bits of code; especially as I was stuck on this for a week and a half.

realharryhero avatar Dec 31 '23 16:12 realharryhero

@sayakpaul

realharryhero avatar Dec 31 '23 16:12 realharryhero