torchsearchsorted icon indicating copy to clipboard operation
torchsearchsorted copied to clipboard

Implement the `side` keyword argument as in numpy

Open baldassarreFe opened this issue 5 years ago • 12 comments

The Numpy version of searchsorted takes an additional parameter to specify the semantic of the returned index:

side returned index i satisfies
left a[i-1] < v <= a[i]
right a[i-1] <= v < a[i]

The current implementation of searchsorted for Pytorch follows the left behavior, is it possible to implement the right version to allow ordering from the right?

baldassarreFe avatar Oct 16 '19 16:10 baldassarreFe

hi, could you check out the latest version of master ?

aliutkus avatar Oct 22 '19 13:10 aliutkus

Hi, just checked.

Here are a few comments about the code:

In the pytest test here, why do we have separate Ba, Bv and then skip the test if Ba != Bv? Couldn't the test iterate over a single value B for the batch dimension? Like:

B =[1, 100, 200]
A_val = [1, 50, 500]
V_val = [1, 12, 120]
side_val = ['left', 'right']
nrepeat = 100
@pytest.mark.parametrize('B,A,V,side', product(Ba_val, Bv_val, A_val, V_val, side_val))
def test_searchsorted_correct(B, A, V, side, device):
    for test in range(nrepeat):
        a = torch.sort(torch.rand(B, A, device=device), dim=1)[0]
        v = torch.rand(B, V, device=device)
        ....

I see the function numpy_searchsorted was moved to the main source folder, but it was meant as a test-only utility. That's why it was in test/utils.py and not in src: it was available to pytest at test time, but not available anywhere else. Since this repo is a pytorch implementation of searchsorted, I don't think the numpy version should be part of the main code, in particular it shouldn't be exposed in src/torchsearchsorted/__init__.py as it could confuse the user.

For testing the results in examples/test.py we should use torch.testing.assert_allclose instead of using the norm, also because as you noticed the torch.norm expects float tensors instead of long indexes. Also, is this test necessary now that we have unit tests in place? Maybe the examples could simply showcase how you use the searchsorted function.


Regarding the implementation of side, did you manage to get pytest to succeed? On my machine, it runs some of the tests successfully, then on one test it hangs forever while appearing busy.

If you run pytest -v you'll see it hangs on different test cases, sometimes it's test/test_searchsorted.py::test_searchsorted_correct[cpu-100-100-50-120-right] other times it's test/test_searchsorted.py::test_searchsorted_correct[cpu-200-200-500-120-right]

My guess is that some bug in the code for side=right results in the binary search loop getting stuck forever. I'll have to look into that more carefully though.

baldassarreFe avatar Oct 23 '19 12:10 baldassarreFe

Hello I am having similar troubles with the side parameter. I am trying to get a minimal code for reproducing the bug but it seems the trouble is happening randomly.

ncourty avatar Oct 23 '19 13:10 ncourty

Thanks for your feedback

concerning this:

why do we have separate Ba, Bv

This is because actually one from a or b can have 1 as a first dimension while the other is larger, which was not tested. However, they should not have different first dimensions if different from 1. You forgot these in your tests, as well as in your numpy_searchsorted actually, (hence the sel subfunction there now).

I however realize that the test code is indeed wrong, because it should be:

if Ba>1 and Bv>1 and Ba != Bv:
        return

This is fixed now.

Concerning the placement of this numpy_searchsorted in the main package, I thought it may be useful to several people for testing. Since there is anyways a dependency of torch on numpy (isnt't there?) I thought it would not be harmful,

Now, concerning the tests:

  • when it fails, as I now mention in the README.md:

    In some cases, the results vary from numpy's version. However, as far as I could see, this only happens when values are equal, which means we actually don't care about the order in which this value is added. I decided not to care about this also.

    I investigated this, and this happens when the value to insert and the array in which to insert are equal. It actually means that the result is right, just different from numpy. Do we really care in that case ?

  • when it stops forever: it's true, I also observe it actually, will investigate

what do you think ?

aliutkus avatar Oct 23 '19 15:10 aliutkus

ok, I found out the error in the endless loop:  this was happening when the last element of a was equal to the value to insert. This should be fixed. Now investigating whether it's worth/easy to fix the discrepancies with numpy

aliutkus avatar Oct 23 '19 15:10 aliutkus

Great ! Your last update solved the bug on my side. Thanks for the nice code 👍

ncourty avatar Oct 23 '19 16:10 ncourty

no pb, sorry for this, I shouldn't have merged with master so early

aliutkus avatar Oct 23 '19 16:10 aliutkus

Well, I argue that we do care about what happens in case the value to insert is already present in the array. The whole point of the side parameter is to decide where to insert that value in case of this ambiguity. Let me first show you some examples and then I'll explain my use case:

If the value is not preset in the array, so both sides give the same answer:

>>> a = np.array([1, 2, 3, 5, 6, 7])
>>> v = 4
>>> np.searchsorted(a, v, side='left')
3
>>> np.searchsorted(a, v, side='right')
3

With side=left the returned index will place the new value to the left of identical values present in the array:

>>> a = np.array([1, 2, 3, 3, 3, 3, 3, 3, 6, 7])
>>> v = 3
>>> np.searchsorted(a, v, side='left')
2

On the contrary, side=right will give an insertion index that places the new value on the right of identical values present in the array:

>>> a = np.array([1, 2, 3, 3, 3, 3, 3, 3, 6, 7])
>>> v = 3
>>> np.searchsorted(a, v, side='left')
8

Now, for my use case. I use searchsorted to bucketize a series of real-valued distances into bins of unequal size. E.g. the distance .3 must end up in the first bin [0, 1) and the distance 3.4 must end up in the second bin [1, 5) etc. The bin indexes then are one-hot encoded and become input features for a model.

The bounds of these buckets are represented like this, with the last bucket being [100, +inf)]:

buckets = [0, 1, 5, 10, 20, 100]

Therefore, for these input distances:

dist = [0, .3, 4, 4.5, 5, 6.8, 23.4, 123, 401]

I expect this output:

idxs = [0, 0, 1, 1, 2, 2, 4, 5, 5]

Which I can get by doing

>>> np.searchsorted(buckets, dist, side='right') - 1
[0, 0, 1, 1, 2, 2, 4, 5, 5]

While doing this would give the wrong index for items on the bucket boundaries:

>>> np.searchsorted(buckets, dist, side='left')
[0, 1, 2, 2, 2, 3, 5, 6, 6]

baldassarreFe avatar Oct 26 '19 19:10 baldassarreFe

Ah about the broadcasting behavior for the batch dimension. I understand now what that test and the sel function were about. I think it could be written more clearly by using numpy's broadcast_arrays and pytorch's broadcast_tensors. Those utility functions take care of the broadcasting for us and are tested in the respective projects.

I can update the python code where needed, should be quite quick. Then some time next week I'll have a look at the C++ and CUDA parts to address the side logic ;)

baldassarreFe avatar Oct 26 '19 19:10 baldassarreFe

ha! thanks for the explanations! ok, feel free to update the code and make a new PR when ready

aliutkus avatar Oct 26 '19 20:10 aliutkus

I have a working version of this in my fork. I rewrote most of the search code, which should now be more readable and possibly faster. Do you want to check it out?

I'll send a PR very soon, just want to write a benchmark script ;)

baldassarreFe avatar Nov 18 '19 15:11 baldassarreFe

Hi @baldassarreFe

ok, cool, thanks for this ! Please make sure that you start from a fresh version before doing a PR ! best

antoine

aliutkus avatar Nov 19 '19 06:11 aliutkus