tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

TimeSeriesSVC prediction is very slow

Open maz369 opened this issue 4 years ago • 6 comments

I have been working with the library and recently found out that TimeSeriesSVC().predict runs very slow and requires huge memory. Can you please let me know if there is a way around this issue? I am trying to make 100K predictions of 1D time series (length of each time series is less than 100 values) and it requires more than 100 GB of memory and take multiple days to get the result.

Thank you

maz369 avatar Sep 12 '20 04:09 maz369

Hi @maz369

SVMs often work with kernel functions internally, which calculates the similarity between each pair of training samples. In your case, this comes down to a 100_000x100_000 matrix.

Nevertheless, we should probably look into a more memory-efficient representation of these pairwise similarities (e.g. by using sparse matrices).

GillesVandewiele avatar Sep 12 '20 06:09 GillesVandewiele

Thank you for the explanation. It makes sense now.

maz369 avatar Sep 12 '20 09:09 maz369

Most welcome. I will reopen the issue for now, as I do believe that it should be possible (in the future) to fit a dataset with 100K timeseries with perhaps more memory-efficient data structures (perhaps take a look at how sklearn handles larger datasets with their SVMs).

EDIT: Based on the doc page of sklearn.svm.SVC it seems that they advise to either use no kernel (LinearSVC) or something called Nystroem kernel approximation for large datasets:

The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using sklearn.svm.LinearSVC or sklearn.linear_model.SGDClassifier instead, possibly after a sklearn.kernel_approximation.Nystroem transformer.

GillesVandewiele avatar Sep 12 '20 09:09 GillesVandewiele

BTW, not only SVM need to be improved, DTW also runs out of memories if the input data is with more than 100k samples. Sparse matrix is a direct solution.

PercyLau avatar Oct 10 '20 08:10 PercyLau

Have you maybe found a way to speed up the training? I need to perform training on a dataset with over 100k samples and it's taking forever. Even the silhouette score for 10k samples takes forever..

StankovskiA avatar Mar 08 '22 08:03 StankovskiA

Perhaps Kernel Approximation could speed it up: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_approximation (Nystroem seems most relevant)

GillesVandewiele avatar Mar 08 '22 09:03 GillesVandewiele