notebooks
notebooks copied to clipboard
Training SegFormer model not working (goes through notebook, but model loss becomes nan) on dataset I created (stuck for a week or so)
When trying to train a SegFormer model on this notebook, changing the variable ds to some contrails datasets that I have been sending to huggingface, such as this one, the model's loss turns to nan (and perhaps (?) it sometimes crashes after training the first epoch).
This does not occur when training segment.ai's sidewalks dataset. This may have something to do with some differences in my segmentation bitmaps or some issues with the duckdb files (the duckdb files seem to be formatted differently on the sidewalks dataset compared to my contails dataset).
Why does this occur?
(I obtained the contrails images from this competition's dataset.)
@sayakpaul
Try lowering down the learning rate.
The model's loss still becomes nan even with 10x (1000x?) lower learning rate than what was originally in the notebook. A few errors also occur; a screenshot and some text describing the error are below.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[30], line 1
----> 1 model.fit(
2 train_set,
3 validation_data=val_set,
4 callbacks=callbacks,
5 epochs=epochs,
6 )
File ~/jupyter/miniconda3/envs/tf3.10new/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
67 filtered_tb = _process_traceback_frames(e.__traceback__)
68 # To get the full stack trace, call:
69 # `tf.debugging.disable_traceback_filtering()`
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
File ~/jupyter/miniconda3/envs/tf3.10new/lib/python3.10/site-packages/transformers/keras_callbacks.py:256, in KerasMetricCallback.on_epoch_end(self, epoch, logs)
253 all_preds = self._postprocess_predictions_or_labels(prediction_list)
254 all_labels = self._postprocess_predictions_or_labels(label_list)
--> 256 metric_output = self.metric_fn((all_preds, all_labels))
257 if not isinstance(metric_output, dict):
258 raise TypeError(
259 f"metric_fn should return a dict mapping metric names to values but instead returned {metric_output}"
260 )
Cell In[27], line 29
25 per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
26 per_category_iou = metrics.pop("per_category_iou").tolist()
28 metrics.update(
---> 29 {f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}
30 )
31 metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
32 return {"val_" + k: v for k, v in metrics.items()}
Cell In[27], line 29
25 per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
26 per_category_iou = metrics.pop("per_category_iou").tolist()
28 metrics.update(
---> 29 {f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}
30 )
31 metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
32 return {"val_" + k: v for k, v in metrics.items()}
KeyError: 2
I think I figured it out; the labels file I used had pixel value 255 as contrails, pixel value 1 as another ("filler") class, and pixel value 0 as unlabeled. But I think I needed to have a pixel value 2 as contrails, to have the pattern "0 1 2 3 ...".
Sort of "closed," but this is a very dumb issue. Any way to fix it in the future? Shouldn't take too long to change some bits of code; especially as I was stuck on this for a week and a half.
@sayakpaul