tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

NAN loss for LearningShapelet with variable length time-series while training

Open Wwwwei opened this issue 4 years ago • 6 comments

Hi all, I am trying to train a LearningShapelet model with variable length time-series (refer to https://tslearn.readthedocs.io/en/latest/variablelength.html).

from tslearn.utils import to_time_series_dataset from tslearn.shapelets import LearningShapelets X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]]) y = [0, 0, 1] clf = LearningShapelets(n_shapelets_per_size={3: 1}, verbose=1, max_iter=10) clf.fit(X, y)

However, I find that the loss turns into 'nan' when training it. Any idea why is this happening? Thank you.

Epoch 1/10 1/1 [==============================] - 0s 372ms/step - loss: 0.8146 - binary_accuracy: 0.6667 - binary_crossentropy: 0.8146 Epoch 2/10 1/1 [==============================] - 0s 997us/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 3/10 1/1 [==============================] - 0s 2ms/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 4/10 1/1 [==============================] - 0s 2ms/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 5/10 1/1 [==============================] - 0s 2ms/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 6/10 1/1 [==============================] - 0s 997us/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 7/10 1/1 [==============================] - 0s 997us/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 8/10 1/1 [==============================] - 0s 2ms/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 9/10 1/1 [==============================] - 0s 998us/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan Epoch 10/10 1/1 [==============================] - 0s 2ms/step - loss: nan - binary_accuracy: 0.6667 - binary_crossentropy: nan

Wwwwei avatar Apr 13 '21 07:04 Wwwwei

Hello,

Have you tried normalizing your data? These nans can be caused by exploding/vanishing gradients.

GillesVandewiele avatar Apr 13 '21 07:04 GillesVandewiele

Thanks for the prompt reply. I try the TimeSeriesScalerMinMax(), but it does not work yet. I wonder if this is caused by to_time_series_dataset, which makes all the time series have the same length by filling nan. X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]]) After this, X = array([[[ 1.],[ 2.],[ 3.],[ 4.],[nan],[nan]], [[ 1.],[ 2.],[ 3.],[nan],[nan],[nan]],[[ 2.],[ 5.],[ 6.],[ 7.],[ 8.],[ 9.]]])

Hello,

Have you tried normalizing your data? These nans can be caused by exploding/vanishing gradients.

Wwwwei avatar Apr 13 '21 08:04 Wwwwei

Hi @Wwwwei

I guess if the problem came from the padded NaN, it would occur at the first epochs, which is not the case, so exploding gradients are probably the cause of your problem.

rtavenar avatar Apr 13 '21 08:04 rtavenar

Hi @Wwwwei

I guess if the problem came from the padded NaN, it would occur at the first epochs, which is not the case, so exploding gradients are probably the cause of your problem.

Hello @rtavenar, appreciate your response. I try gradient clipping for your suggestion, but the bug remains. When I change all the time series to a fixed length or fill the nan in to_time_series_dataset by zero , the model works. Just like the simple demo in https://tslearn.readthedocs.io/en/latest/variablelength.html. Maybe something wrong when handling the variable length time-series?

Wwwwei avatar Apr 13 '21 09:04 Wwwwei

Hi @GillesVandewiele @rtavenar Sorry to bother you again. I seem to have found the reason.

# source code in shapelets.py
class LocalSquaredDistanceLayer(Layer):
    # ……
    def call(self, x, **kwargs):
        # (x - y)^2 = x^2 + y^2 - 2 * x * y
        x_sq = K.expand_dims(K.sum(x ** 2, axis=2), axis=-1)
        y_sq = K.reshape(K.sum(self.kernel ** 2, axis=1),
                         (1, 1, self.n_shapelets))
        xy = K.dot(x, K.transpose(self.kernel))
        return (x_sq + y_sq - 2 * xy) / K.int_shape(self.kernel)[1]
}

We take the derivative of y (i.e., our variable for shapelets), d(x-y)^2/dy = 2y-2x. So when there is nan in x (i.e., input data), the gradient will also be nan. That is why the first epoch works, but others do not (when first backpropagating, the gradient becomes nan). Is my understanding correct?

Wwwwei avatar Apr 26 '21 11:04 Wwwwei

Hey @GillesVandewiele @rtavenar I was trying out using LearningShapelets on a variable length time series data and ran into this error too.. I got nans right from the first epoch. I tried using the TimeSeriesScalerMinMax() to normalize the data and it didn't make a difference.

I also went ahead and used the unit test for variable length and the results are nan from the second epoch. I added a normalizing and still the same.

# Test variable-length
y = [0, 1]
time_series = to_time_series_dataset([[1, 2, 3, 4, 5], [3, 2, 1]])
time_series = TimeSeriesScalerMinMax().fit_transform(time_series)
clf = LearningShapelets(n_shapelets_per_size={3: 1},
                        max_iter=5,
                        verbose=1,
                        random_state=0)
clf.fit(time_series, y)

Output -

Epoch 1/5
1/1 [==============================] - 1s 572ms/step - loss: 0.6930 - binary_accuracy: 0.5000 - binary_crossentropy: 0.6930
Epoch 2/5
1/1 [==============================] - 0s 5ms/step - loss: nan - binary_accuracy: 0.5000 - binary_crossentropy: nan
Epoch 3/5
1/1 [==============================] - 0s 5ms/step - loss: nan - binary_accuracy: 0.5000 - binary_crossentropy: nan
Epoch 4/5
1/1 [==============================] - 0s 5ms/step - loss: nan - binary_accuracy: 0.5000 - binary_crossentropy: nan
Epoch 5/5
1/1 [==============================] - 0s 5ms/step - loss: nan - binary_accuracy: 0.5000 - binary_crossentropy: nan

Do you think @Wwwwei 's potential reason above is reasonable? If in case the issue is different too, I think it must be addressed, even if it's only a note as to what are the possible causes for this. If I can get some pointers, I can also look further into it to rectify it and create a PR.

Div12345 avatar Dec 21 '22 04:12 Div12345