tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans.

Open sukhejai opened this issue 2 years ago • 3 comments

Dear Dev Team,

@ecederstrand @rth @rflamary @apachaves @felixdivo

Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans. I currently tried using n_jobs for parallel processing in Databricks but the time taken for clustering is same for 8 CPU and 32CPU machine. It clearly doesn't help.

Can you please suggest what can be the best approach to reduce the time matrix.

Thanks, Ishwar Sukheja

sukhejai avatar Jul 14 '22 16:07 sukhejai

Hi Team,

Any updates?

Thanks, Ishwar

sukhejai avatar Jul 26 '22 13:07 sukhejai

Hi @sukhejai , I'm not part of the dev team but thank you for calling me here.

Have you tried to monitor the CPU use? I know in Databricks it might not be that simple and I also know there is overhead happening in the JVM layers below so I'm always cautious with what happens inside it.

Can I suggest maybe that you run a test with n_jobs outside Databricks? Maybe in a Jupyter notebook running locally in your machine. And then, open the resource monitor to double-check the parallelization is indeed happening and all CPUs are being used.

Finally, would be nice to have it copied here the information of this CPU use plus a small code snippet with the example you tried to run. I'm sure with that the dev team will be able to narrow down better the best solutions for you.

Hope it helps, I'm curious too.

Best, Anderson

apachaves avatar Jul 27 '22 06:07 apachaves

Any update on this ? Currently I am trying to get this running on pyspark. @apachaves can we get timeseries kmeans to run with pyspark dataframe? the data I am working with is to big for pandas. Thanks.

justkrismanohar avatar Jul 25 '23 02:07 justkrismanohar