spark-tsne
spark-tsne copied to clipboard
Allows large data?
Hello
I currently use sklearn's TSNE, and it is not very memory friendly. I wonder how this project compares to that one in terms of the rows in the data it can handle. Thanks.
That was the hope then I found that I need a scalable knn implementation, which distracted me to work on https://github.com/saurfang/spark-knn. Unfortunately I no longer have time pursuing this project. However I am happy to answer any questions or review any contributions.
Curious, has anyone been able to run this on large datasets. Was wondering what are the issues you ran into and approx run times
I am using a 3GB dataset with 100 features and so far had to update the following properties with new values:
"spark.rpc.askTimeout=1000" "spark.akka.frameSize=256" "spark.driver.maxResultSize=2G"
to fix the exceptions, I ran into. Also, the driver and executor needs to have lots of memory, I am using 10G for each (with 12 executors) and the t-SNE is still running after about 14 hrs...
I am using the same approach as shown in the MNIST.scala example: com/github/saurfang/spark/tsne/examples/MNIST.scala
Any thoughts/ideas on speeding this up....
Regards, Rajesh