pynndescent icon indicating copy to clipboard operation
pynndescent copied to clipboard

Problems with sparse

Open pavlin-policar opened this issue 5 years ago • 7 comments

I've been trying to get this to work with sparse matrices.

The setup:

>>> import pynndescent
>>> import scipy.sparse as sp
>>> x = sp.random(1000, 1000, density=0.01)

Next, I try to construct the index; this raises an error:

>>> nn = pynndescent.NNDescent(x)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-63ef2a9d7c43> in <module>
----> 1 pynndescent.NNDescent(x)

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_level, tree_init, random_state, algorithm, max_candidates, n_iters, delta, rho, n_jobs, seed_per_row, verbose)
    565                     )
    566                 metric_nn_descent = sparse.make_sparse_nn_descent(
--> 567                     distance_func, tuple(metric_kwds.values())
    568                 )
    569                 if verbose:

AttributeError: 'NoneType' object has no attribute 'values'

Okay, so metric_kwds must be specified. Very unexpected, but ok, I can fix that.

>>> nn = pynndescent.NNDescent(x, metric_kwds={})  # works!

Great, so I've got the index. Now I want to query it:

>>> nn.query(x)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-11b3373648c0> in <module>
----> 1 nn.query(x)

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, queue_size)
    697         """
    698         # query_data = check_array(query_data, dtype=np.float64, order='C')
--> 699         query_data = np.asarray(query_data).astype(np.float32)
    700         self._init_search_graph()
    701         init = initialise_search(

TypeError: float() argument must be a string or a number, not 'coo_matrix'

Again, weird that I can build an index for a sparse matrix, but not query with it. Ok, I convert it to a dense matrix:

>>> nn.query(x.toarray())
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
<ipython-input-15-c4301d4d66b3> in <module>
----> 1 nn.query(x.toarray())

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, queue_size)
    706             self._random_init,
    707             self._tree_init,
--> 708             self.rng_state,
    709         )
    710         result = self._search(

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in initialise_search(forest, data, query_points, n_neighbors, init_from_random, init_from_tree, rng_state)
     73 ):
     74     results = make_heap(query_points.shape[0], n_neighbors)
---> 75     init_from_random(n_neighbors, data, query_points, results, rng_state)
     76     if forest is not None:
     77         for tree in forest:

~/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    348                 e.patch_message(msg)
    349 
--> 350             error_rewrite(e, 'typing')
    351         except errors.UnsupportedError as e:
    352             # Something unsupported is present in the user code, add help info

~/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/dispatcher.py in error_rewrite(e, issue_type)
    315                 raise e
    316             else:
--> 317                 reraise(type(e), e, None)
    318 
    319         argtypes = []

~/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/six.py in reraise(tp, value, tb)
    656             value = tp()
    657         if value.__traceback__ is not tb:
--> 658             raise value.with_traceback(tb)
    659         raise value
    660 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Internal error at <numba.typeinfer.ArgConstraint object at 0x7f0cf1997c18>:
--%<----------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/errors.py", line 627, in new_error_context
    yield
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/typeinfer.py", line 201, in __call__
    assert ty.is_precise()
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/typeinfer.py", line 144, in propagate
    constraint(typeinfer)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/typeinfer.py", line 202, in __call__
    typeinfer.add_type(self.dst, ty, loc=self.loc)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/errors.py", line 635, in new_error_context
    six.reraise(type(newerr), newerr, tb)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/six.py", line 659, in reraise
    raise value
numba.errors.InternalError: 
[1] During: typing of argument at /home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py (39)
--%<----------------------------------------------------------------------------


File "../../miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 39:
    def init_from_random(n_neighbors, data, query_points, heap, rng_state):
        for i in range(query_points.shape[0]):
        ^

This error may have been caused by the following argument(s):
- argument 1: cannot determine Numba type of <class 'scipy.sparse.csr.csr_matrix'>

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

I noticed something to do with csr_matrices in that error, so maybe COO format is not supported?

>>> nn.query(x.tocsr())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-77873307e81a> in <module>
----> 1 nn.query(x.tocsr())

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, queue_size)
    697         """
    698         # query_data = check_array(query_data, dtype=np.float64, order='C')
--> 699         query_data = np.asarray(query_data).astype(np.float32)
    700         self._init_search_graph()
    701         init = initialise_search(

ValueError: setting an array element with a sequence.

Same error as with the COO matrix.

A little bit of environment info:

Python 3.7.3
---------------------------
numba==0.43.1
scipy==1.2.0
pynndescent==0.3.0

pavlin-policar avatar Jun 02 '19 16:06 pavlin-policar

Hmm, definitely still some issues -- things got brought over from umap, but apparently there are some issues. I don't have time to dig into this right now, but hopefully in a week or two I will have time.

lmcinnes avatar Jun 02 '19 17:06 lmcinnes

Thanks! Let me know if I've missed out any useful information you might need.

pavlin-policar avatar Jun 02 '19 17:06 pavlin-policar

I finally got some time and dug into this. The short answer was that not everything got fully ported over from umap, and there were some issues with sparse support in threaded mode (is as much as that wasn't yet implemented). I think I've got that all fixed now. If you can try cloning the current master and let me know if that resolves the issue for you I would appreciate it.

lmcinnes avatar Jun 11 '19 22:06 lmcinnes

Thanks so much for looking into this! Both of the problems I mentioned seem to be fixed; I can use it on sparse just like I would expect.

Unfortunately, building and querying the index now raises several numba warnings, even when I run it on dense data e.g.

/home/pavlin/dev/pynndescent/pynndescent/sparse_nndescent.py:121: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "pynndescent/utils.py", line 459:
@numba.njit(parallel=True)
def new_build_candidates(
^

  False,
/home/pavlin/dev/pynndescent/pynndescent/sparse_nndescent.py:82: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "pynndescent/utils.py", line 79:
@numba.njit(parallel=True)
def rejection_sample(n_samples, pool_size, rng_state):
^

  indices = rejection_sample(n_neighbors, n_vertices, rng_state)

when building the index and

/home/pavlin/dev/pynndescent/pynndescent/sparse_nndescent.py:273: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "pynndescent/sparse_nndescent.py", line 175:
@numba.njit(parallel=True)
def sparse_init_from_random(
^

  dist_args,
/home/pavlin/dev/pynndescent/pynndescent/sparse_nndescent.py:288: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "pynndescent/sparse_nndescent.py", line 207:
@numba.njit(parallel=True)
def sparse_init_from_tree(
^

  dist_args,
/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba-0.44.0-py3.7-linux-x86_64.egg/numba/ir_utils.py:1958: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected list' found for argument 'forest' of function 'sparse_initialise_search'.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "pynndescent/sparse_nndescent.py", line 248:
@numba.njit()
def sparse_initialise_search(
^

  warnings.warn(NumbaPendingDeprecationWarning(msg, loc=loc))

when querying the index.

pavlin-policar avatar Jun 13 '19 12:06 pavlin-policar

It looks like the warnings have started appearing from Numba 0.44. I've opened #68 to remove the ones related to parallel=True. There are two more to do with 'reflective list' being deprecated, but those can't be easily fixed until the replacement appears in Numba 0.45. In the meantime the warnings can be ignored, or Numba 0.43 (or earlier) can be specified explicitly when installing.

tomwhite avatar Jun 13 '19 14:06 tomwhite

Thanks for #68 @tomwhite, that shoudl take care of a lot of them. This is popping up in umap as well, so I will have to take care of it there soon.

lmcinnes avatar Jun 13 '19 14:06 lmcinnes

There are two more to do with 'reflective list' being deprecated, but those can't be easily fixed until the replacement appears in Numba 0.45.

@tomwhite I haven't seen those warnings myself, but I think Numba 0.45 is out already. Do you know what's the status of those warnings?

dkobak avatar Sep 09 '19 13:09 dkobak