umap icon indicating copy to clipboard operation
umap copied to clipboard

update function of UMAP does not work

Open Ben-Epstein opened this issue 4 years ago • 6 comments

I'm trying to build an incremental trainer for umap, updating on batches of data. I'm testing this out with mnist.

import numpy as np
import sklearn.datasets
import umap
import umap.utils as utils
import umap.aligned_umap
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(int)

first, second = mnist.data[:50000], mnist.data[50000:]
print(first.shape, second.shape)

standard_embedding = umap.UMAP(random_state=42).fit(first)
standard_embedding.update(second)

on update I see

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/3d/d0dl2ykn6c18qg7kg_j7tplm0000gn/T/ipykernel_98177/3602609767.py in <module>
----> 1 standard_embedding.update(second)

~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/umap/umap_.py in update(self, X)
   3129 
   3130         else:
-> 3131             self._knn_search_index.update(X)
   3132             self._raw_data = self._knn_search_index._raw_data
   3133             (

~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pynndescent/pynndescent_.py in update(self, X)
   1611         X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")
   1612 
-> 1613         original_order = np.argsort(self._vertex_order)
   1614 
   1615         if self._is_sparse:

AttributeError: 'NNDescent' object has no attribute '_vertex_order'

Is this expected behavior? Am I using UMAP improperly here? I see an example of aligned_umap but I was hoping to use the standard umap as I do not have relations

Ben-Epstein avatar Oct 20 '21 15:10 Ben-Epstein

It is certainly not expected behaviour. I'll have to look into this a little and see if I can reproduce it to figure out why this is going astray (it certainly worked at one time).

lmcinnes avatar Oct 21 '21 16:10 lmcinnes

I actually ran into this problem yesterday and have a fix ready to go @lmcinnes, will open a PR. It's only an issue for the n>4096 path in update.

ThomasNickerson avatar Oct 21 '21 19:10 ThomasNickerson

Hi, I am also trying to build an incremental UMAP trainer where I load my embedding vectors in batches of 9k. But for the following code :

cnt = 0
start_index = 0
end_index = 100
search_space = {}
reducer = None

# Loading all search space pickles one by one
for i in tqdm(range(start_index, end_index), leave = True, position = 0):
  searchSpacePath = emb_vectors_path + "Split_" + str(i) + const_string
  embDict = {}
  with open(searchSpacePath, 'rb') as handle:
    embDict = pickle.load(handle)
  
  search_space.update(embDict)

  data_to_df = convert_to_df(list(embDict.values()))

  # updating the low dimensional graph with new points :

  # first set of vectors loaded & fit :
  if i == 0:
    reducer = umap.UMAP(n_neighbors = 10,
                        min_dist = 0.0,
                        n_components = 20,
                        metric = "euclidean",
                        random_state = random_state).fit(data_to_df)
  else:
    reducer.update(data_to_df)

print("\nLoaded %d items"%(len(search_space)))

I am facing this error :

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-30-f5089cfcbf5e> in <module>()
     26                         random_state = random_state).fit(data_to_df)
     27   else:
---> 28     reducer.update(data_to_df)
     29 
     30 print("\nLoaded %d items"%(len(search_space)))

/usr/local/lib/python3.7/dist-packages/umap/umap_.py in update(self, X)
   3346             )
   3347             init[:original_size] = self.embedding_
-> 3348             init_update(init, original_size, self._knn_indices)
   3349 
   3350             if self.n_epochs is None:

ZeroDivisionError: division by zero

My data is basically L2 normalised vectors of 1024 dimensions & I am making use of update to incrementally add 9 lakh data points to the UMAP mapper.

I tried the MNIST example & it worked properly with ".update()". The only difference I could find with my data & MNIST one was that mine is L2 normalised, is that causing the zero division error ?

Also I tried to load 3 lakh vectors & apply UMAP on them directly via the fit function. The results were amazing but for any data more than 3 lakh my Colab instance kept on crashing. Is there any other way to handle such a large dataset for dimensionality reduction ?

vedrocks15 avatar Jan 20 '22 12:01 vedrocks15

@ThomasNickerson is that merged PR released in a new version?

Ben-Epstein avatar Jan 20 '22 13:01 Ben-Epstein

@ThomasNickerson Really cool to see this new PR! I was playing around a bit with this new functionality and seemed to have uncovered a bug, I think related to @vedrocks15's post. When trying out different dataset splits using the update function I noticed that I was getting the division by zero error in the init_update function. After a bit more digging, I believe that the issue happens when you call update on a dataset that is larger than the current dataset that UMAP has been fit over.

As a reference here is the code that I am using:

import numpy as np
import sklearn.datasets
import umap
import umap.plot
import umap.utils as utils
import umap.aligned_umap
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(int)
mnist_data = mnist.data.values
mnist_labels = mnist.target.values

split = 20000 
first, second = mnist_data[:split], mnist_data[split:]
standard_embedding = umap.UMAP(random_state=42, verbose=True).fit(first)
standard_embedding.update(second)

Since the size of the MNIST dataset is 70_000, based on some spot testing, if you set split to be < 35_000 you get the divide by zero error described by @vedrocks15.

Additionally, here is the code for the init_update function, where I believe the error comes in the line current_init[i, d] /= n

@numba.njit()
def init_update(current_init, n_original_samples, indices):
    for i in range(n_original_samples, indices.shape[0]):
        n = 0
        for j in range(indices.shape[1]):
            for d in range(current_init.shape[1]):
                if indices[i, j] < n_original_samples:
                    n += 1
                    current_init[i, d] += current_init[indices[i, j], d]
        for d in range(current_init.shape[1]):
            current_init[i, d] /= n

    return

jonathangomesselman avatar Jan 20 '22 23:01 jonathangomesselman

@jonathangomesselman even I saw this code & clearly the "n" variable in the loop is not being updated which is causing the division by zero error. I would just like to add that I also tried to .fit() the UMAP object with 3 lakh vectors (it worked) followed by using ".update()" with only 9k vectors & again the same problem of division by zero error.

vedrocks15 avatar Jan 21 '22 04:01 vedrocks15

Hey @lmcinnes & @vedrocks15 I am working on something similar and got the same error of divide by zero.
My dataset has more than 4M rows and 384 dimensions. While trying to reduce the dimension to 50, my 32 Gb RAM system doesn't take all of the 4M rows at once and I had to go with Batch processing. I am trying to fit the small chunks of data to UMAP and in the process of doing that, update doesn't seem to help much.

First of the small chunk: xvs[:10000].shape => (10000, 384)

model1 = umap.UMAP(
            n_neighbors=30,
            min_dist=0.0,
            n_components=50,
            random_state=42,
            ).fit(xvs[:10000])

model1.embedding_.shape => (10000, 50)

model1.update(xvs[10000:20000]) gives the following error

ZeroDivisionError                         Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 model1.update(xvs[10000:20000])

File ~\anaconda3\envs\py39\lib\site-packages\umap\umap_.py:3348, in UMAP.update(self, X)
   3344 init = np.zeros(
   3345     (self._raw_data.shape[0], self.n_components), dtype=np.float32
   3346 )
   3347 init[:original_size] = self.embedding_
-> 3348 init_update(init, original_size, self._knn_indices)
   3350 if self.n_epochs is None:
   3351     n_epochs = 0

ZeroDivisionError: division by zero

But when I re-run the same update code, I get different error this time.

ValueError                                Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 model1.update(xvs[10000:20000])

File ~\anaconda3\envs\py39\lib\site-packages\umap\umap_.py:3347, in UMAP.update(self, X)
   3329 self.graph_, self._sigmas, self._rhos = fuzzy_simplicial_set(
   3330     self._raw_data,
   3331     self.n_neighbors,
   (...)
   3341     self.verbose,
   3342 )
   3344 init = np.zeros(
   3345     (self._raw_data.shape[0], self.n_components), dtype=np.float32
   3346 )
-> 3347 init[:original_size] = self.embedding_
   3348 init_update(init, original_size, self._knn_indices)
   3350 if self.n_epochs is None:

ValueError: could not broadcast input array from shape (10000,50) into shape (20000,50)

Not sure how to approach this problem and if there is any better solution for the batch processing in UMAP as I just need to fit the chunks of data and I need model.embedding_ at the end to follow the next steps.

Thank you.!

preet2312 avatar Dec 16 '22 22:12 preet2312