umap icon indicating copy to clipboard operation
umap copied to clipboard

ValueError: cannot assign slice from input of different size

Open roablep opened this issue 1 year ago • 7 comments

I am using umap to reduce dimensions prior to running HDBSCAN. For reasons I do not understand, I periodically get a ValueError. It works with other n_components values, e.g. 2

I can work to generate a minimally reproducible example shortly.

umap2 = umap.UMAP(n_components=2)
projection2d = umap2.fit_transform(embedding.toarray())

... other code not relevant ... 

uniq = pd.Series(df['stxt'].unique().astype(str)).str.lower().tolist()
embedding = TfidfVectorizer(ngram_range=(1,1), stop_words='english').fit_transform(uniq)
umapN = umap.UMAP(n_components=umap_val)
reduced_dim_embed = umapN.fit_transform(embedding.toarray())



embedding.shape
(16888, 4400)

Here's the stack trace

In [16]: reduced_dim_embed = umapN.fit_transform(embedding.toarray())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 reduced_dim_embed = umapN.fit_transform(embedding.toarray())

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/umap/umap_.py:2772, in UMAP.fit_transform(self, X, y)
   2742 def fit_transform(self, X, y=None):
   2743     """Fit X into an embedded space and return that transformed
   2744     output.
   2745
   (...)
   2770         Local radii of data points in the embedding (log-transformed).
   2771     """
-> 2772     self.fit(X, y)
   2773     if self.transform_mode == "embedding":
   2774         if self.output_dens:

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/umap/umap_.py:2516, in UMAP.fit(self, X, y)
   2510     nn_metric = self._input_distance_func
   2511 if self.knn_dists is None:
   2512     (
   2513         self._knn_indices,
   2514         self._knn_dists,
   2515         self._knn_search_index,
-> 2516     ) = nearest_neighbors(
   2517         X[index],
   2518         self._n_neighbors,
   2519         nn_metric,
   2520         self._metric_kwds,
   2521         self.angular_rp_forest,
   2522         random_state,
   2523         self.low_memory,
   2524         use_pynndescent=True,
   2525         n_jobs=self.n_jobs,
   2526         verbose=self.verbose,
   2527     )
   2528 else:
   2529     self._knn_indices = self.knn_indices

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/umap/umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose)
    325     n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0)))
    326     n_iters = max(5, int(round(np.log2(X.shape[0]))))
--> 328     knn_search_index = NNDescent(
    329         X,
    330         n_neighbors=n_neighbors,
    331         metric=metric,
    332         metric_kwds=metric_kwds,
    333         random_state=random_state,
    334         n_trees=n_trees,
    335         n_iters=n_iters,
    336         max_candidates=60,
    337         low_memory=low_memory,
    338         n_jobs=n_jobs,
    339         verbose=verbose,
    340         compressed=False,
    341     )
    342     knn_indices, knn_dists = knn_search_index.neighbor_graph
    344 if verbose:

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/pynndescent/pynndescent_.py:804, in NNDescent.__init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
    793         print(ts(), "Building RP forest with", str(n_trees), "trees")
    794     self._rp_forest = make_forest(
    795         data,
    796         n_neighbors,
   (...)
    802         self._angular_trees,
    803     )
--> 804     leaf_array = rptree_leaf_array(self._rp_forest)
    805 else:
    806     self._rp_forest = None

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/pynndescent/rp_trees.py:1097, in rptree_leaf_array(rp_forest)
   1095 def rptree_leaf_array(rp_forest):
   1096     if len(rp_forest) > 0:
-> 1097         return np.vstack(rptree_leaf_array_parallel(rp_forest))
   1098     else:
   1099         return np.array([[-1]])

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/pynndescent/rp_trees.py:1089, in rptree_leaf_array_parallel(rp_forest)
   1088 def rptree_leaf_array_parallel(rp_forest):
-> 1089     result = joblib.Parallel(n_jobs=-1, require="sharedmem")(
   1090         joblib.delayed(get_leaves_from_tree)(rp_tree) for rp_tree in rp_forest
   1091     )
   1092     return result

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/joblib/parallel.py:1098, in Parallel.__call__(self, iterable)
   1095     self._iterating = False
   1097 with self._backend.retrieval_context():
-> 1098     self.retrieve()
   1099 # Make sure that we get a last message telling us we are done
   1100 elapsed_time = time.time() - self._start_time

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/joblib/parallel.py:975, in Parallel.retrieve(self)
    973 try:
    974     if getattr(self._backend, 'supports_timeout', False):
--> 975         self._output.extend(job.get(timeout=self.timeout))
    976     else:
    977         self._output.extend(job.get())

File ~/opt/anaconda3/lib/python3.9/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout)
    769     return self._value
    770 else:
--> 771     raise self._value

File ~/opt/anaconda3/lib/python3.9/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123 job, i, func, args, kwds = task
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:
    127     if wrap_exception and func is not _helper_reraises_exception:

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/joblib/_parallel_backends.py:620, in SafeFunction.__call__(self, *args, **kwargs)
    618 def __call__(self, *args, **kwargs):
    619     try:
--> 620         return self.func(*args, **kwargs)
    621     except KeyboardInterrupt as e:
    622         # We capture the KeyboardInterrupt and reraise it as
    623         # something different, as multiprocessing does not
    624         # interrupt processing for a KeyboardInterrupt
    625         raise WorkerInterrupt() from e

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/joblib/parallel.py:288, in BatchedCalls.__call__(self)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~/github/rivanna-research/.venv/lib/python3.9/site-packages/joblib/parallel.py:288, in <listcomp>(.0)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

ValueError: cannot assign slice from input of different size

roablep avatar Jul 25 '23 19:07 roablep

I have exactly the same issue when using n_components=4 . I think it has something to do with the size of the embeddings array and the number of epochs for which the umap is trained. I managed to work around it by decreasing the dimensionality of my embeddings. Another thing which also helps is to decrease the number of epochs for which the umap trains. I was using custom value of 500 because it seemed to produce slightly better embeddings for my dataset and computation time wasn't such an issue. When I switched back to using 200 (default for large datasets) the bug occurs less often.

Yokto13 avatar Jul 31 '23 14:07 Yokto13

I've been getting the same error for n_components=5.

ckrapu avatar Aug 04 '23 15:08 ckrapu

This is actually an issue with pynndescent, which has since been fixed, but is waiting on a new release for the fix to propagate to PyPI. Hopefully that will be soon (along with another minor release of umap).

lmcinnes avatar Aug 04 '23 17:08 lmcinnes

I experience a similar unhandled exception. Downgrading to pynndescent == 0.5.8 worked for me.

yksnilowyrahcaz avatar Sep 10 '23 00:09 yksnilowyrahcaz

Can anyone confirm if this issue has been fixed with the 0.5.11 release ?

KaiWaldrant avatar Nov 30 '23 21:11 KaiWaldrant

Can anyone confirm if this issue has been fixed with the 0.5.11 release ?

Upgrading pynndescent from 0.5.10 to 0.5.11 fixed this ValueError for me.

stevenshave avatar Dec 04 '23 15:12 stevenshave

Can anyone confirm if this issue has been fixed with the 0.5.11 release ?

Upgrading pynndescent from 0.5.10 to 0.5.11 fixed this ValueError for me.

Can confirm upgrade works too.

sahramohamed avatar Dec 11 '23 16:12 sahramohamed