The neighborhood graph is not connected
Hi there, I am currently using the tapkee executable as a pipeline in my project in python, I am creating a text file with all my values to reduce wich looks like this :
0.301961,0.376471,0.443137,0,0.992157,0 0.188235,0.235294,0.298039,0.847059,0.752941,0 0.192157,0.239216,0.294118,0.854902,0.72549,0 0.266667,0.282353,0.380392,0.901961,0.505882,0.8 0.215686,0.211765,0.215686,0.913725,0.498039,0.8 0.25098,0.262745,0.286275,0.917647,0.494118,0.8 0.309804,0.305882,0.333333,0.905882,0.419608,0.8 0.258824,0.321569,0.396078,0.831373,0.811765,0 0.239216,0.298039,0.376471,0.831373,0.807843,0 0.231373,0.290196,0.368627,0.819608,0.741176,0 0.290196,0.360784,0.443137,0.784314,0.780392,0 0.301961,0.376471,0.443137,0,0.992157,0 0.188235,0.235294,0.298039,0.847059,0.752941,0 0.192157,0.239216,0.294118,0.854902,0.72549,0 0.266667,0.282353,0.380392,0.901961,0.505882,0.8 0.215686,0.211765,0.215686,0.913725,0.498039,0.8 0.25098,0.262745,0.286275,0.917647,0.494118,0.8 0.309804,0.305882,0.333333,0.905882,0.419608,0.8 0.258824,0.321569,0.396078,0.831373,0.811765,0
This example works fine but if I add one more line like
0.1,0.2,0.3,0.4,0.5,0
I get the message "[warning] The neighborhood graph is not connected"

So I check for solution and the only help I could find is this message that you wrote in your documentation.
Please note that "[warning] The neighborhood graph is not connected" message in most cases means that ’tapkee’ run was unsuccessful. As a result, Tapkee() might return the matrix of NaN’s. One of possible workarounds is to specify the higher number of neigbors (’-k’ option, default is 10). See below for the example.
Then i tried to change the k values and it works when it's equal or superior to the number of rows divised by 2. So if I have 30 rows I have to make the neighbour values to 15 or higher.
Is it right or something else is wrong ? I am supposed to use this method with a lot of rows (4000 to million), it gonna be long no ? I may not understand perfectly how it works.
Thanks for your help.
Hey! Sorry it took a while to respond you.
The problem is that the neighborhood strategy being used in Tapkee is dead simple: it finds the nearest neighbors for each point and then uses these connections exclusively. This means that if you add a point that is not a neighbor for any other point it gets 'dropped' out of graph (never reachable).
To solve the problem it might make sense to allow increasing the number of connections if some points lack them to be connected. Other mitigation might be to notify what are the disconnected points so that you can drop them from the dataset. I am not aware what mitigation might be the best, though.
Hi,
Only now I have gone through this. Just in case, excuses for the years-long lead time.
I am wondering what is the meaning of that a point is not a neighbor of any other point? Even if a point is very, very far away from all the others, at least one of the others must be the closest to it. Unless there's some sort of maximum distance between neighbors, I am guessing.
I will start running the investigation together with one for ** On entry to DLASCL parameter number 4 had an illegal value I've been encountering and keep you posted.
I am wondering what is the meaning of that a point is not a neighbor of any other point?
All right, I think this little inquietude doesn't really matter: even if every point has at least one neighbor the overall neighborhood graph could of course have more than one connected component
It is just a matter of a form of the distance matrix. Once it becomes a block matrix the whole thing stops working because eigendecomposition is not unique, I think.
Basically, every connected component should be embedded separately. But it doesn't make a lot of sense since we would not have any idea how they are positioned against each other. So that's why I think it is better to either:
- Drop the smallest components
- Try to fix the graph so that it is connected
- Drop some outliers, maybe
- Throw an exception
Basically, every connected component should be embedded separately. But it doesn't make a lot of sense since we would not have any idea how they are positioned against each other.
Ah, that's a good point, I didn't think about the separate embeddings.
So that's why I think it is better to either:
- Drop the smallest components
- Try to fix the graph so that it is connected
- Drop some outliers, maybe
- Throw an exception
This is what I thought earlier, it would the in the second point:
The issue is that given a number of neighbors (k) k_0 by the user (or the default one, if unprovided) the neighbor graph can be, let's say, unconnected and it is expected the result won't be trustworthy then. If we would use a k equal to the number of points (or minus 1), then it would be surely connected, right? From there, I think something that can be automated (perhaps as an optional routine enabled by default) is to rerun the neighbors binary-searching between k_0 and the number of points until one that produces a connected graph is found. It could of course be refined looking for the minimum k that makes it connected.
I opened #83 with a simple fix to recompute the neighbors if the neighborhood graph is not connected.
By the way, I didn't manage to reproduce with the data in the opening comment. The number of vectors I got was either 19 or 20 (with the extra one), different to the 21 in the screenshot. I used the swissroll5000 from the jmlr repo to test this. In that one the binary search went from the initial 10 to 2505, which worked well for the graph, but then stopped quickly due to out of memory usage in the eigendecomposition. Then, I updated it to just multiplying by 2.