update function of UMAP does not work
I'm trying to build an incremental trainer for umap, updating on batches of data. I'm testing this out with mnist.
import numpy as np
import sklearn.datasets
import umap
import umap.utils as utils
import umap.aligned_umap
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(int)
first, second = mnist.data[:50000], mnist.data[50000:]
print(first.shape, second.shape)
standard_embedding = umap.UMAP(random_state=42).fit(first)
standard_embedding.update(second)
on update I see
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/3d/d0dl2ykn6c18qg7kg_j7tplm0000gn/T/ipykernel_98177/3602609767.py in <module>
----> 1 standard_embedding.update(second)
~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/umap/umap_.py in update(self, X)
3129
3130 else:
-> 3131 self._knn_search_index.update(X)
3132 self._raw_data = self._knn_search_index._raw_data
3133 (
~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pynndescent/pynndescent_.py in update(self, X)
1611 X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")
1612
-> 1613 original_order = np.argsort(self._vertex_order)
1614
1615 if self._is_sparse:
AttributeError: 'NNDescent' object has no attribute '_vertex_order'
Is this expected behavior? Am I using UMAP improperly here? I see an example of aligned_umap but I was hoping to use the standard umap as I do not have relations
It is certainly not expected behaviour. I'll have to look into this a little and see if I can reproduce it to figure out why this is going astray (it certainly worked at one time).
I actually ran into this problem yesterday and have a fix ready to go @lmcinnes, will open a PR. It's only an issue for the n>4096 path in update.
Hi, I am also trying to build an incremental UMAP trainer where I load my embedding vectors in batches of 9k. But for the following code :
cnt = 0
start_index = 0
end_index = 100
search_space = {}
reducer = None
# Loading all search space pickles one by one
for i in tqdm(range(start_index, end_index), leave = True, position = 0):
searchSpacePath = emb_vectors_path + "Split_" + str(i) + const_string
embDict = {}
with open(searchSpacePath, 'rb') as handle:
embDict = pickle.load(handle)
search_space.update(embDict)
data_to_df = convert_to_df(list(embDict.values()))
# updating the low dimensional graph with new points :
# first set of vectors loaded & fit :
if i == 0:
reducer = umap.UMAP(n_neighbors = 10,
min_dist = 0.0,
n_components = 20,
metric = "euclidean",
random_state = random_state).fit(data_to_df)
else:
reducer.update(data_to_df)
print("\nLoaded %d items"%(len(search_space)))
I am facing this error :
ZeroDivisionError Traceback (most recent call last)
<ipython-input-30-f5089cfcbf5e> in <module>()
26 random_state = random_state).fit(data_to_df)
27 else:
---> 28 reducer.update(data_to_df)
29
30 print("\nLoaded %d items"%(len(search_space)))
/usr/local/lib/python3.7/dist-packages/umap/umap_.py in update(self, X)
3346 )
3347 init[:original_size] = self.embedding_
-> 3348 init_update(init, original_size, self._knn_indices)
3349
3350 if self.n_epochs is None:
ZeroDivisionError: division by zero
My data is basically L2 normalised vectors of 1024 dimensions & I am making use of update to incrementally add 9 lakh data points to the UMAP mapper.
I tried the MNIST example & it worked properly with ".update()". The only difference I could find with my data & MNIST one was that mine is L2 normalised, is that causing the zero division error ?
Also I tried to load 3 lakh vectors & apply UMAP on them directly via the fit function. The results were amazing but for any data more than 3 lakh my Colab instance kept on crashing. Is there any other way to handle such a large dataset for dimensionality reduction ?
@ThomasNickerson is that merged PR released in a new version?
@ThomasNickerson Really cool to see this new PR! I was playing around a bit with this new functionality and seemed to have uncovered a bug, I think related to @vedrocks15's post. When trying out different dataset splits using the update function I noticed that I was getting the division by zero error in the init_update function. After a bit more digging, I believe that the issue happens when you call update on a dataset that is larger than the current dataset that UMAP has been fit over.
As a reference here is the code that I am using:
import numpy as np
import sklearn.datasets
import umap
import umap.plot
import umap.utils as utils
import umap.aligned_umap
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(int)
mnist_data = mnist.data.values
mnist_labels = mnist.target.values
split = 20000
first, second = mnist_data[:split], mnist_data[split:]
standard_embedding = umap.UMAP(random_state=42, verbose=True).fit(first)
standard_embedding.update(second)
Since the size of the MNIST dataset is 70_000, based on some spot testing, if you set split to be < 35_000 you get the divide by zero error described by @vedrocks15.
Additionally, here is the code for the init_update function, where I believe the error comes in the line current_init[i, d] /= n
@numba.njit()
def init_update(current_init, n_original_samples, indices):
for i in range(n_original_samples, indices.shape[0]):
n = 0
for j in range(indices.shape[1]):
for d in range(current_init.shape[1]):
if indices[i, j] < n_original_samples:
n += 1
current_init[i, d] += current_init[indices[i, j], d]
for d in range(current_init.shape[1]):
current_init[i, d] /= n
return
@jonathangomesselman even I saw this code & clearly the "n" variable in the loop is not being updated which is causing the division by zero error. I would just like to add that I also tried to .fit() the UMAP object with 3 lakh vectors (it worked) followed by using ".update()" with only 9k vectors & again the same problem of division by zero error.
Hey @lmcinnes & @vedrocks15 I am working on something similar and got the same error of divide by zero.
My dataset has more than 4M rows and 384 dimensions. While trying to reduce the dimension to 50, my 32 Gb RAM system doesn't take all of the 4M rows at once and I had to go with Batch processing. I am trying to fit the small chunks of data to UMAP and in the process of doing that, update doesn't seem to help much.
First of the small chunk:
xvs[:10000].shape => (10000, 384)
model1 = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=50,
random_state=42,
).fit(xvs[:10000])
model1.embedding_.shape => (10000, 50)
model1.update(xvs[10000:20000]) gives the following error
ZeroDivisionError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 model1.update(xvs[10000:20000])
File ~\anaconda3\envs\py39\lib\site-packages\umap\umap_.py:3348, in UMAP.update(self, X)
3344 init = np.zeros(
3345 (self._raw_data.shape[0], self.n_components), dtype=np.float32
3346 )
3347 init[:original_size] = self.embedding_
-> 3348 init_update(init, original_size, self._knn_indices)
3350 if self.n_epochs is None:
3351 n_epochs = 0
ZeroDivisionError: division by zero
But when I re-run the same update code, I get different error this time.
ValueError Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 model1.update(xvs[10000:20000])
File ~\anaconda3\envs\py39\lib\site-packages\umap\umap_.py:3347, in UMAP.update(self, X)
3329 self.graph_, self._sigmas, self._rhos = fuzzy_simplicial_set(
3330 self._raw_data,
3331 self.n_neighbors,
(...)
3341 self.verbose,
3342 )
3344 init = np.zeros(
3345 (self._raw_data.shape[0], self.n_components), dtype=np.float32
3346 )
-> 3347 init[:original_size] = self.embedding_
3348 init_update(init, original_size, self._knn_indices)
3350 if self.n_epochs is None:
ValueError: could not broadcast input array from shape (10000,50) into shape (20000,50)
Not sure how to approach this problem and if there is any better solution for the batch processing in UMAP as I just need to fit the chunks of data and I need model.embedding_ at the end to follow the next steps.
Thank you.!