words2map
words2map copied to clipboard
Clustering in 2D - is that the best choice?
Question to Y-hat folks: why cluster in 2D? Granted, clustering in 300D is hard :) Still, the 2D projection must add a significant metric distortion. Why not a middle ground, say, 5-10D ? Have you tried that?
Thanks @robinlabs, that's definitely a great question.
The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler...
In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online...
Definitely hope to improve this, and any ideas / contributions are welcome!
HDBSCAN may very well break in 300D, but 5-10D may be reasonable while forcing less metric distortion & still with quite a bit of noise suppression. If you do try that, would be interesting to know the results!
On Tue, Jul 26, 2016 at 1:52 PM, Lance Legel [email protected] wrote:
Thanks @robinlabs https://github.com/robinlabs, that's definitely a great question.
The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler...
In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online...
Definitely hope to improve this, and any ideas / contributions are welcome!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/overlap-ai/words2map/issues/2#issuecomment-235401229, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6d3mMCZYIw9hc9kEHmm0oHaKHqrwxTks5qZnOqgaJpZM4JVj5y .
-- Ilya Eckstein, PhD cofounder / CEO @ Robin Labs * *650-223-5797 www.robinlabs.com http://www.robinlabs.com
Definitely!
I suspect at some point 3D HDBSCAN is going to be awesome to set up (probably when we're hooking up an internal dashboard for overlap.ai) and around that time I'll do a check on all this and report back.