Implement self-organizing maps (SOMs) in `linfa-clustering`
If possible, I'd like to add an implementation of self-organizing maps to linfa-clustering. Unlike many of the clustering algorithms already implemented in linfa, SOMs preserve topological relationships, offering a continuous map from n-dimensional space to a 2D grid. This makes SOMs particularly useful for visualizing higher-dimensional data compared to many alternatives such as k-means clustering, which simply partitions the data into clusters without necessarily maintaining spatial relationships or "distance" between points.
Maybe also look at generative topographic mapping, which uses a specialised form of the gaussian mixture model and the EM algorithm to achieve the same objective.
https://proceedings.neurips.cc/paper_files/paper/1996/file/4e4e53aa080247bc31d0eb4e7aeb07a0-Paper.pdf
Yes @michielderoo, makes sense. Thanks for the suggestion :) I'll probably start working on the PR in a few days, if you want to join me.
To @michielderoo and the maintainers: I have been busy with life, work, school, and other open-source stuff, but I've just remembered opening this issue and plan to start a PR soon, if the maintainers think SOMs would be a good addition to linfa. 🙂
I've just took a look at scikit-learn to see what they have done. SOMs failed to make it ten years ago. More recently, this discussion seems to suggest that recent direction for performant SOMs would use deep learning and GPUs which are also out of scope for linfa.
At the same time I do not mean to dismiss your proposal (we are not compelled to do like scikit-learn after all), I am not familiar with SOMs and having an implementation in Rust could be valuable. It would be interesting to benchmark the method against kmeans for instance and/or highlight some benefits on typical use cases.
@relf Thanks for the response! What would you like to do next, then? As stated in my original post,
SOMs preserve topological relationships, offering a continuous map from n-dimensional space to a 2D grid. This makes SOMs particularly useful for visualizing higher-dimensional data compared to many alternatives such as k-means clustering, which simply partitions the data into clusters without necessarily maintaining spatial relationships or "distance" between points.
I do concur that SOMs are perhaps less performant than other clustering methods, but they do also produce qualitatively different results. What, precisely, would you like me to research/investigate next before potentially proceeding with a PR?
I don't think you can compare SOM to k-means because it serves more a dimensionality reduction purpose. Compared to PCA it projects data onto a non-linear plane instead of a linear plane as in PCA.
I understand you don't know the SOM, that's understandable. But it is widely used. Kohonen compiled a list of over 7000 scientific articles using it from 1981 till 2005. You can find the addendum here:
https://users.ics.aalto.fi/tho/online-papers/TKK-ICS-R23.pdf
It doesn't need GPU or deep learning. It has been done since 1980...
I don't think you can compare SOM to k-means because it serves more a dimensionality reduction purpose. Compared to PCA it projects data onto a non-linear plane instead of a linear plane as in PCA.
I understand you don't know the SOM, that's understandable. But it is widely used. Kohonen compiled a list of over 7000 scientific articles using it from 1981 till 2005. You can find the addendum here:
https://users.ics.aalto.fi/tho/online-papers/TKK-ICS-R23.pdf
It doesn't need GPU or deep learning. It has been done since 1980...
Indeed, @michielderoo, I concur. Your thoughts on what specifically to investigate before deciding whether to proceed, @relf?
What, precisely, would you like me to research/investigate next before potentially proceeding with a PR?
It would be interesting to benchmark the method against kmeans for instance and/or highlight some benefits on typical use cases.
@michielderoo thanks for the reference (2005 is pretty old though, I suppose it is still widely used 😉). So I understand SOM should not be compared to kmeans or any clustering methods. Does this mean that SOM should be in linfa-reduction with PCA?
@Luis-Varona my point was to find the benefits of such method compared to legacy linfa algorithms. I would like to see relevant documentation/examples to back up the integration in linfa. Actually, the fact it is not included in scikit-learn would advocate for starting a separate project based on the linfa framework and delay the integration till the dust settles down.
@relf Thanks again for the timely response. Are you suggesting that I just make a separate crate using the linfa traits, publish to crates.io, then later merge it into linfa after we discuss a bit more? You don't want me to make a PR directly to linfa-clustering? (In my opinion, SOMs fit clustering better than reduction, although perhaps @michielderoo would know better.)
It is definitely not wrong to say SOM is clustering. But to me the dimensionality reduction properties is what sets it apart.
I think a separate crate for SOM would be good. There are many different kinds of SOM, so the scope is broad enough.
I would love to help, but I won't be available till september.
Alright @michielderoo, so you and @relf both recommend I make a separate crate for SOMs (maybe linfa-kohonen or linfa-soms? I prefer the former; you?) with the linfa traits then we can potentially merge it in later?
linfa-kohonen sounds good to me.
@relf Is it an idea to start a linfa-contrib crate, where we can put the algorithms that don't fit in the core of linfa. Scikit has the scikit-learn-contrib package for the same reason. That way we can keep things bundled.
I would say let's start with linfa-kohonen. At the moment, the use of linfa- prefix seems enough to show a project is linfa-friendly. It will be always possible to move repos under a common organization (linfa-contrib or even rust-ml?) if need be.
@relf @michielderoo, sounds good. Will start sometime in the next few days and let you know if I need any more guidance regarding this sort of "associated crate but not a sub-crate" structure. 🙂 Thanks for all your feedback!