tslearn
tslearn copied to clipboard
Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans.
Dear Dev Team,
@ecederstrand @rth @rflamary @apachaves @felixdivo
Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans. I currently tried using n_jobs for parallel processing in Databricks but the time taken for clustering is same for 8 CPU and 32CPU machine. It clearly doesn't help.
Can you please suggest what can be the best approach to reduce the time matrix.
Thanks, Ishwar Sukheja
Hi Team,
Any updates?
Thanks, Ishwar
Hi @sukhejai , I'm not part of the dev team but thank you for calling me here.
Have you tried to monitor the CPU use? I know in Databricks it might not be that simple and I also know there is overhead happening in the JVM layers below so I'm always cautious with what happens inside it.
Can I suggest maybe that you run a test with n_jobs outside Databricks? Maybe in a Jupyter notebook running locally in your machine. And then, open the resource monitor to double-check the parallelization is indeed happening and all CPUs are being used.
Finally, would be nice to have it copied here the information of this CPU use plus a small code snippet with the example you tried to run. I'm sure with that the dev team will be able to narrow down better the best solutions for you.
Hope it helps, I'm curious too.
Best, Anderson
Any update on this ? Currently I am trying to get this running on pyspark. @apachaves can we get timeseries kmeans to run with pyspark dataframe? the data I am working with is to big for pandas. Thanks.