somoclu
somoclu copied to clipboard
Quantization and Topographical errors calculation
Hi,
I am using SOMOCLU in my work and it is amazing! Thanks a lot for the hard work.
There is something I feel I miss which is the quantization and topographical errors calculation. For the quantization error I have everything I need to calculate it from the python wrapper, but for the topographical, I would need the second bmus (i.e. the units corresponding to the minimum distances excluding the BMU), that way I would be able to check for how many cases they are not direct neighbors of the BMUs (i.e. the topographical error),
Any clue on how to get them?
Thanks a lot I.
Thanks. I swear this question came up before, but I can't find it. You can use get_surface_state
to get the activation map for a given set of data instances (or for the training set if you don't pass any parameter). The argmin along axis 1 is the BMU (see the code in get_bmus
for extracting the X, Y coordinates). If you change the argmin in get_bmus
to get the second lowest value, you get what you need.
Thanks a lot! Your help is very appreciated. I programmed some functions to calculate the topographic error with hex lattice and it is giving me topographic errors unusually high... (greater than 30%)
Here is a reproducible example so that you can check. I revised it and I think it has no bugs, but if you spot any error, just tell it to me, please.
import somoclu
import numpy as np
import random
from sklearn.datasets.california_housing import fetch_california_housing
def get_coordinates_from_index(n, x, y):
if n<0:
return(-1,-1)
return(n%y, int(n/y))
def get_index_from_coordinates(c, x, y):
if c[0]<0 or c[1]<0 or c[0]>y or c[1]>x:
return(-1)
return(c[1]*y + c[0])
def get_neighbors_from_index(n, x, y):
c = get_coordinates_from_index(n, x, y)
offset_x = -1 if c[1]%2==0 else 1
neighbors_c =[(c[0]-1,c[1]), (c[0]+1,c[1]), (c[0],c[1]-1), (c[0]+offset_x,c[1]-1), (c[0],c[1]+1),
(c[0]+offset_x,c[1]+1)]
neighbors_c = list(filter(lambda c: (c[0]>=0) and (c[1]>=0) and (c[1]<x) and (c[0]<y), neighbors_c))
neighbors = list(map(lambda c: get_index_from_coordinates(c, x, y), neighbors_c))
return list(set(neighbors))
def calculate_topographical_error(som, n_rows, n_columns):
bmus_1st_and_2nd = np.argsort(som.get_surface_state(), axis=1)[:,:2]
neighbors=list(map(lambda t:get_neighbors_from_index(t, n_rows, n_columns), bmus_1st_and_2nd[:,0]))
e_t=1-np.mean([second in neighs for (second,neighs) in zip(bmus_1st_and_2nd[:,1], neighbors)])
return(e_t)
data = fetch_california_housing()["data"]
names = fetch_california_housing()["feature_names"]
labels = list(range(150))
n_rows, n_columns = 15,7
som = somoclu.Somoclu(n_columns, n_rows, gridtype="hexagonal")
som.train(data, epochs=1000)
e_t = calculate_topographical_error(som, n_rows, n_columns)
print("The topographical error obtained is: %s"%e_t)
If this is not an error of my code, I think it is a critical issue, because it could mean that the topology is not preserving the topology
Thank you in advance!
Not sure about this. The calculation is definitely not right for the toroid surface, which should have a higher quality embedding.
I also tried PCA initialization. That significantly improves the topographical error.
Yes, I tried changing different parameters... The lowest error I achieved has been 25%, but still super high. Compared to the SOM Toolbox in Matlab or SOMPY, it is high; so strange. Do you have any clue on what to revise or where potentially could be some issue? I'll try to revise.
This is really weird. Normally I set the epochs much lower (around 10), and the map larger, but it further increases the topographic error. I am really puzzled by this. I wonder if this could be an artefact of batch training. Does batch training in SOM Toolbox and SOMPY give a low error?
When I trained Self Organising Maps with SOM toolbox, I always used batching, never tried without batch method. And SOMPY only implements batch method; I don't really think it is the cause of the issue.
There is something that I assumed and maybe is not correct... How are the hexagons in the hexagonal grid oriented with respect to the matrix? I assumed they are arranged like in the image I attach below. This is crucially important because I could be calculating the neighbors wrong...
Thanks!
That is the lattice. I tried to reduce the problem on a toy problem on a rectangular lattice. What I noticed is that there is a strong competition to be the 2nd BMU: there are many nodes with hardly differing distances. So instead of a single 2nd BMU, I picked a set with a threshold set to 10e-5, and that dramatically improved the topographic error (~0-10% in the best cases). I also noticed that the best values correspond to the case when the points that mapped roughly uniformly on the SOM. Maybe you can extend this to the hexagonal case, although I never saw any compelling reason to use a hexagonal grid. The code is super-sloppy, sorry about it.
import somoclu
import numpy as np
def get_coordinates_from_index(n, x, y):
if n < 0:
return (-1, -1)
else:
return (n % y, n // y)
def get_index_from_coordinates(c, x, y):
if c[0] < 0 or c[1] < 0 or c[0] > y or c[1] > x:
return -1
else:
return c[1]*y + c[0]
def get_neighbors_from_index(n, x, y):
c = get_coordinates_from_index(n, x, y)
neighbors_c = [(c[0]-1, c[1]), (c[0]+1, c[1]),
(c[0], c[1]-1), (c[0], c[1]+1)]
neighbors_c = [c for c in neighbors_c
if c[0] >= 0 and c[1] >= 0 and c[1] < x and c[0] < y]
neighbors = [get_index_from_coordinates(c, x, y) for c in neighbors_c]
return list(set(neighbors))
def calculate_topographical_error(som, n_rows, n_columns):
surface_state = som.get_surface_state()
bmus_1st_and_2nd = np.argsort(surface_state, axis=1)[:, :2]
all_2nd = []
for i, second_index in enumerate(bmus_1st_and_2nd[:, 1]):
all_2nd.append([])
distance = surface_state[i, second_index]
for s_i, s in enumerate(surface_state[i]):
if abs(s-distance) < 10e-5:
all_2nd[-1].append(s_i)
neighbors = [get_neighbors_from_index(t, n_rows, n_columns)
for t in bmus_1st_and_2nd[:, 0]]
e_t = 1-np.mean([len(set(second) & set(neighs)) > 0 for (second, neighs) in
zip(all_2nd, neighbors)])
return e_t
c1 = np.random.rand(5, 3)/5
c2 = (0.6, 0.1, 0.05) + np.random.rand(5, 3)/5
data = np.float32(np.concatenate((c1, c2)))
colors = ["red"] * 5 + ["green"] * 5
n_rows, n_columns = 5, 10
som = somoclu.Somoclu(n_columns, n_rows)
som.train(data, epochs=10)
e_t = calculate_topographical_error(som, n_rows, n_columns)
print("The topographical error obtained is: %s"%e_t)
som.view_umatrix(bestmatches=True, bestmatchcolors=colors)
Ok thank you very much! I will take a look. The reason I use hex lattice is because every hexagon has more neighbors. In addition, the neighbors of a hexagon are equidistant to the hexagon they are neighbors of; that yields a more natural mapping.
Again, thank you for the library. I am going to open another issue cause I have developed a code for plotting the hexagonal grid in a very fancy way and I would be keen of contributing with it, if you are interested.
Well, I have just been playing with my toy problem. Once the SOM has been trained, I define a all-zeros component plane and I plot where the first and second BMUs lay. I have to say that even it seems the algorithm is working (and of course not doing things at random), the first and second BMUs sometimes lay considerably separated. The topographical error I spotted is correctly calculated. It seems that something strange happens with the hex lattice... I will continue checking
Thanks. I welcome all contributions, of course. I know that in theory, hexagonal grids are more advantageous, but in practice, I've never seen any tangible difference. So the hexagonal grid has not received much attention in the implementation and we cannot exclude the possibility that there is a bug. I quickly went through the relevant low-level C++ code, and I can't spot anything obvious (the relevant codebook update is in denseCPUKernels.cpp
, starting on l.157).
@ivallesp : did you tried to play with the std_coeff
parameter ? Somoclu uses 0.5 by default, but SOM toolbox uses 1.
Also, can you indicate more precisely what results you get with every version please (it would be useful to indicate all the relevant elements to give a good comparision basis: the dataset used, the SOM dimensions and topology, and the training parameters).
I tried your code with a rectangular topology and 8 neighbors, and things get worse, which is somewhat strange.
Hi!
I came back becasue I got a use case for the SOM and I remembered this issue. I just tried and it seems it is working the same way. This is actually my only reason for not using this library so, do you have any clue of why it is now achieving decent topographic errors? this is crucial for the cooperation phase of the algorithm to perform well.
Thanks!
@ivallesp In my latest experiments with my sparse-som tool, which seems to suffer from the same problem, I noticed that topographic error decrease nicely during the first iterations, but abruply increase in the end of training.
I didn't investigated further, but the topographic error behavior seems related to the nature of the data and also with with the std_coeff
(which controls the final "differenciation" between the nodes at the end of training). Maybe this can help you.
Thanks so much @yoch. I increased std_coeff to 1 as you suggested but it seems to achieve the same result. And regarding the data, I tested with the same data using the SOMPY package and it seems there I have a good result (topographic error < 1% while with this library is about 30%). Just check it out! :D
https://github.com/sevamoo/SOMPY/blob/master/sompy/examples/California%20Housing.ipynb
The problem is that this library, as written in C mainly, is very difficult for me to debug... Any helo will be very appreciated
Note that the code you provided uses variance normalization, which improve the results greatly. But the problem persists for me (I guess it's the same with somoclu, but I haven't tested).
I tried with this code :
import scipy.sparse as sp
from sklearn.preprocessing import scale
from sklearn.datasets import fetch_california_housing
from sparse_som import *
data = fetch_california_housing()
data = scale(data.data) # IMPORTANT
data = sp.csr_matrix(data) # because I use sparse_som
m, n = data.shape
som = BSom(27, 27, n)
som.verbose = 2
som.train(data, std=0.5)
first, second, _ = som._bmus_and_seconds(data)
print('TE:', som._topographic_error(first, second, m))
and got these results (TE is the topgraphic error) :
Start learning...
- TE: 0.981686
epoch 1 / 10, r = 13 - TE: 0.157752
epoch 2 / 10, r = 11.6111 - TE: 0.0291667
epoch 3 / 10, r = 10.2222 - TE: 0.0172965
epoch 4 / 10, r = 8.83333 - TE: 0.0162306
epoch 5 / 10, r = 7.44444 - TE: 0.0123547
epoch 6 / 10, r = 6.05556 - TE: 0.0148256
epoch 7 / 10, r = 4.66667 - TE: 0.0170543
epoch 8 / 10, r = 3.27778 - TE: 0.0213663
epoch 9 / 10, r = 1.88889 - TE: 0.0420543
epoch 10 / 10, r = 0.5
Finished: elapsed 2.49944s
TE: 0.1626453488372093
Hi @ivallesp ,
I know that this subject is from a few years ago, but I am currently working with self-organizing maps for my thesis (thanks peter for somoclu package, it really is wonderful).
However, I can't figure out how to compute quantization error for rectangular grid, and you said that you managed to compute it from the python wrapper, is that right ? Could you show me how you did it please ?
Would be much appreciated.
Thanks for your help.
Hi @maevaedcoding ! I always love to provide help on SOM topics.
I ended up writing my own SOM library: https://github.com/ivallesp/somnium/tree/master/somnium
It supports hexagonal and rectangular grids, and it has topographic and quantization errors implemented. See here the example: https://github.com/ivallesp/somnium/blob/master/examples/getting_started.ipynb