umap icon indicating copy to clipboard operation
umap copied to clipboard

ValueError: cannot assign slice from input of different size

Open Jiawei-Xing opened this issue 2 years ago • 8 comments

Hi, I want to use UMAP on a large distance matrix (369911x369911). I followed the first example of "UMAP on sparse data" from the tutorial (I've tried either lil or csr sparse matrix). The code worked well with a smaller sample dataset but failed on my large matrix. My sparse matrix is ~9 GB, and I was running it on an HPC node with 10 cpus (~30 GB). The low_memory option was set to True.

Traceback (most recent call last):
  File "umap.py", line 51, in <module>
    mapper = reducer.fit(matrix)
  File "/home/xing.232/.local/lib/python3.7/site-packages/umap/umap_.py", line 2526, in fit
    verbose=self.verbose,
  File "/home/xing.232/.local/lib/python3.7/site-packages/umap/umap_.py", line 340, in nearest_neighbors
    compressed=False,
  File "/home/xing.232/.local/lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 804, in __init__
    leaf_array = rptree_leaf_array(self._rp_forest)
  File "/home/xing.232/.local/lib/python3.7/site-packages/pynndescent/rp_trees.py", line 1097, in rptree_leaf_array
    return np.vstack(rptree_leaf_array_parallel(rp_forest))
  File "/home/xing.232/.local/lib/python3.7/site-packages/pynndescent/rp_trees.py", line 1090, in rptree_leaf_array_parallel
    joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/local/anaconda3-2020.02/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
  File "/usr/local/anaconda3-2020.02/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 620, in __call__
    return self.func(*args, **kwargs)
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 289, in __call__
    for func, args, kwargs in self.items]
  File "/home/xing.232/.local/lib/python3.7/site-packages/joblib/parallel.py", line 289, in <listcomp>
    for func, args, kwargs in self.items]
ValueError: cannot assign slice from input of different size

Jiawei-Xing avatar May 13 '23 04:05 Jiawei-Xing

I believe this is an issue relating to caching compilation in pynndescent. If you reinstall pynndescent, preferably directly from github, it should resolve the issue.

To be clear, however, if pynndescent is running then it is computing nearest neighbors of vectors, so it is treating your distance matrix as a large set of sparse vectors. That probably isn't what you want. I would check that this is actually what you want to do.

lmcinnes avatar May 13 '23 12:05 lmcinnes

Thank you for your quick response! You are right, I should use metric="precomputed" to fit in distance matrix.

A related question for the matrix input: my original similarity matrix is sparse, but when I convert it to distance (1-similarity) most elements become 1. This causes memory issues as the matrix is not sparse anymore. Is it possible to use similarity matrix for fitting the model, or is there other way to overcome this?

Jiawei-Xing avatar May 14 '23 22:05 Jiawei-Xing

Hello. I also have the same error. It is weird that Umap works well on some of my datasets, but return this error message on some datasets with the same format. I also try to reinstall pynndescent directly from github. The same error still exits. Could anyone help? Screen Shot 2023-05-21 at 3 55 45 PM

ReaganGen avatar May 21 '23 20:05 ReaganGen

Hello -

I have the similar issue as the one above. I also reinstalled pynndescent directly from the github master. Note: The script works for smaller files, right now I am running a relatively simple workflow, which breaks on larger files (~44.000,4000):

fit = umap.UMAP(n_neighbors=20,n_components=2,min_dist=0.1) umap_spectrogram = fit.fit_transform(spectrograms_)

ValueError Traceback (most recent call last) Cell In[6], line 7 4 # these settings seem to work pretty good n_neighbors = 30, n_components=3, min_dist=0.5 5 fit = umap.UMAP(n_neighbors=20,n_components=2,min_dist=0.1) ----> 7 umap_spectrogram = fit.fit_transform(spectrograms_)

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\umap\umap_.py:2772, in UMAP.fit_transform(self, X, y) 2742 def fit_transform(self, X, y=None): 2743 """Fit X into an embedded space and return that transformed 2744 output. 2745 (...) 2770 Local radii of data points in the embedding (log-transformed). 2771 """ -> 2772 self.fit(X, y) 2773 if self.transform_mode == "embedding": 2774 if self.output_dens:

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\umap\umap_.py:2516, in UMAP.fit(self, X, y) 2510 nn_metric = self._input_distance_func 2511 if self.knn_dists is None: 2512 ( 2513 self._knn_indices, 2514 self._knn_dists, 2515 self._knn_search_index, -> 2516 ) = nearest_neighbors( 2517 X[index], 2518 self._n_neighbors, 2519 nn_metric, 2520 self._metric_kwds, 2521 self.angular_rp_forest, 2522 random_state, 2523 self.low_memory, 2524 use_pynndescent=True, 2525 n_jobs=self.n_jobs, 2526 verbose=self.verbose, 2527 ) 2528 else: 2529 self._knn_indices = self.knn_indices

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\umap\umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose) 325 n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0))) 326 n_iters = max(5, int(round(np.log2(X.shape[0])))) --> 328 knn_search_index = NNDescent( 329 X, 330 n_neighbors=n_neighbors, 331 metric=metric, 332 metric_kwds=metric_kwds, 333 random_state=random_state, 334 n_trees=n_trees, 335 n_iters=n_iters, 336 max_candidates=60, 337 low_memory=low_memory, 338 n_jobs=n_jobs, 339 verbose=verbose, 340 compressed=False, 341 ) 342 knn_indices, knn_dists = knn_search_index.neighbor_graph 344 if verbose:

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\pynndescent\pynndescent_.py:804, in NNDescent.init(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose) 793 print(ts(), "Building RP forest with", str(n_trees), "trees") 794 self._rp_forest = make_forest( 795 data, 796 n_neighbors, (...) 802 self._angular_trees, 803 ) --> 804 leaf_array = rptree_leaf_array(self._rp_forest) 805 else: 806 self._rp_forest = None

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\pynndescent\rp_trees.py:1097, in rptree_leaf_array(rp_forest) 1095 def rptree_leaf_array(rp_forest): 1096 if len(rp_forest) > 0: -> 1097 return np.vstack(rptree_leaf_array_parallel(rp_forest)) 1098 else: 1099 return np.array([[-1]])

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\pynndescent\rp_trees.py:1089, in rptree_leaf_array_parallel(rp_forest) 1088 def rptree_leaf_array_parallel(rp_forest): -> 1089 result = joblib.Parallel(n_jobs=-1, require="sharedmem")( 1090 joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest 1091 ) 1092 return result

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:1098, in Parallel.call(self, iterable) 1095 self._iterating = False 1097 with self._backend.retrieval_context(): -> 1098 self.retrieve() 1099 # Make sure that we get a last message telling us we are done 1100 elapsed_time = time.time() - self._start_time

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:975, in Parallel.retrieve(self) 973 try: 974 if getattr(self._backend, 'supports_timeout', False): --> 975 self._output.extend(job.get(timeout=self.timeout)) 976 else: 977 self._output.extend(job.get())

File c:\ProgramData\Anaconda\envs\seis\Lib\multiprocessing\pool.py:774, in ApplyResult.get(self, timeout) 772 return self._value 773 else: --> 774 raise self._value

File c:\ProgramData\Anaconda\envs\seis\Lib\multiprocessing\pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception) 123 job, i, func, args, kwds = task 124 try: --> 125 result = (True, func(*args, **kwds)) 126 except Exception as e: 127 if wrap_exception and func is not _helper_reraises_exception:

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib_parallel_backends.py:620, in SafeFunction.call(self, *args, **kwargs) 618 def call(self, *args, **kwargs): 619 try: --> 620 return self.func(*args, **kwargs) 621 except KeyboardInterrupt as e: 622 # We capture the KeyboardInterrupt and reraise it as 623 # something different, as multiprocessing does not 624 # interrupt processing for a KeyboardInterrupt 625 raise WorkerInterrupt() from e

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:288, in BatchedCalls.call(self) 284 def call(self): 285 # Set the default nested backend to self._backend but do not set the 286 # change the default number of processes to -1 287 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 288 return [func(*args, **kwargs) 289 for func, args, kwargs in self.items]

File c:\ProgramData\Anaconda\envs\seis\Lib\site-packages\joblib\parallel.py:288, in (.0) 284 def call(self): 285 # Set the default nested backend to self._backend but do not set the 286 # change the default number of processes to -1 287 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 288 return [func(*args, **kwargs) 289 for func, args, kwargs in self.items]

ValueError: cannot assign slice from input of different size

ntmaier avatar May 23 '23 21:05 ntmaier

I also have the same ValueError problem as above mentioned.

liufeifan avatar Jun 20 '23 02:06 liufeifan

Hi

This problem is happening very frequently. I have a dataset wherein UMAP works well. However, when I tried to build umap by applying a 10 fold cross validation, the error appeared in some folds.

Please, advice

ogreyesp avatar Jun 25 '23 13:06 ogreyesp

Hi @ogreyesp, as @lmcinnes said it seems to be pynndescent. Using version pynndescent-0.5.8 works perfectly for me.

carluqcor avatar Jun 26 '23 14:06 carluqcor

I also have the same ValueError problem as above mentioned.

I solved the problem by going back to a previous older version 0.5.0.

liufeifan avatar Jul 20 '23 03:07 liufeifan