spark-tsne Allows large data?

Allows large data?

Open ozgunakalin opened this issue 8 years ago • 2 comments

trafficstars

Hello

I currently use sklearn's TSNE, and it is not very memory friendly. I wonder how this project compares to that one in terms of the rows in the data it can handle. Thanks.

Dec 07 '16 11:12 ozgunakalin

That was the hope then I found that I need a scalable knn implementation, which distracted me to work on https://github.com/saurfang/spark-knn. Unfortunately I no longer have time pursuing this project. However I am happy to answer any questions or review any contributions.

Jan 15 '17 02:01 saurfang

Curious, has anyone been able to run this on large datasets. Was wondering what are the issues you ran into and approx run times

I am using a 3GB dataset with 100 features and so far had to update the following properties with new values:

"spark.rpc.askTimeout=1000" "spark.akka.frameSize=256" "spark.driver.maxResultSize=2G"

to fix the exceptions, I ran into. Also, the driver and executor needs to have lots of memory, I am using 10G for each (with 12 executors) and the t-SNE is still running after about 14 hrs...

I am using the same approach as shown in the MNIST.scala example: com/github/saurfang/spark/tsne/examples/MNIST.scala

Any thoughts/ideas on speeding this up....

Regards, Rajesh

Jun 26 '17 15:06 kartha01

spark-tsne spark-tsne copied to clipboard

Allows large data?

spark-tsne
spark-tsne copied to clipboard