torchsearchsorted
torchsearchsorted copied to clipboard
Implement the `side` keyword argument as in numpy
The Numpy version of searchsorted takes an additional parameter to specify the semantic of the returned index:
side | returned index i satisfies |
---|---|
left | a[i-1] < v <= a[i] |
right | a[i-1] <= v < a[i] |
The current implementation of searchsorted
for Pytorch follows the left
behavior, is it possible to implement the right
version to allow ordering from the right?
hi, could you check out the latest version of master
?
Hi, just checked.
Here are a few comments about the code:
In the pytest test here, why do we have separate Ba, Bv
and then skip the test if Ba != Bv
? Couldn't the test iterate over a single value B
for the batch dimension? Like:
B =[1, 100, 200]
A_val = [1, 50, 500]
V_val = [1, 12, 120]
side_val = ['left', 'right']
nrepeat = 100
@pytest.mark.parametrize('B,A,V,side', product(Ba_val, Bv_val, A_val, V_val, side_val))
def test_searchsorted_correct(B, A, V, side, device):
for test in range(nrepeat):
a = torch.sort(torch.rand(B, A, device=device), dim=1)[0]
v = torch.rand(B, V, device=device)
....
I see the function numpy_searchsorted
was moved to the main source folder, but it was meant as a test-only utility. That's why it was in test/utils.py
and not in src
: it was available to pytest at test time, but not available anywhere else. Since this repo is a pytorch implementation of searchsorted, I don't think the numpy version should be part of the main code, in particular it shouldn't be exposed in src/torchsearchsorted/__init__.py
as it could confuse the user.
For testing the results in examples/test.py
we should use torch.testing.assert_allclose
instead of using the norm, also because as you noticed the torch.norm
expects float tensors instead of long indexes. Also, is this test necessary now that we have unit tests in place? Maybe the examples could simply showcase how you use the searchsorted
function.
Regarding the implementation of side
, did you manage to get pytest to succeed? On my machine, it runs some of the tests successfully, then on one test it hangs forever while appearing busy.
If you run pytest -v
you'll see it hangs on different test cases, sometimes it's
test/test_searchsorted.py::test_searchsorted_correct[cpu-100-100-50-120-right]
other times it's
test/test_searchsorted.py::test_searchsorted_correct[cpu-200-200-500-120-right]
My guess is that some bug in the code for side=right
results in the binary search loop getting stuck forever. I'll have to look into that more carefully though.
Hello I am having similar troubles with the side parameter. I am trying to get a minimal code for reproducing the bug but it seems the trouble is happening randomly.
Thanks for your feedback
concerning this:
why do we have separate
Ba
,Bv
This is because actually one from a
or b
can have 1
as a first dimension while the other is larger, which was not tested. However, they should not have different first dimensions if different from 1
. You forgot these in your tests, as well as in your numpy_searchsorted
actually, (hence the sel
subfunction there now).
I however realize that the test code is indeed wrong, because it should be:
if Ba>1 and Bv>1 and Ba != Bv:
return
This is fixed now.
Concerning the placement of this numpy_searchsorted
in the main package, I thought it may be useful to several people for testing. Since there is anyways a dependency of torch on numpy (isnt't there?) I thought it would not be harmful,
Now, concerning the tests:
-
when it fails, as I now mention in the README.md:
In some cases, the results vary from numpy's version. However, as far as I could see, this only happens when values are equal, which means we actually don't care about the order in which this value is added. I decided not to care about this also.
I investigated this, and this happens when the value to insert and the array in which to insert are equal. It actually means that the result is right, just different from numpy. Do we really care in that case ?
-
when it stops forever: it's true, I also observe it actually, will investigate
what do you think ?
ok, I found out the error in the endless loop: this was happening when the last element of a
was equal to the value to insert. This should be fixed. Now investigating whether it's worth/easy to fix the discrepancies with numpy
Great ! Your last update solved the bug on my side. Thanks for the nice code 👍
no pb, sorry for this, I shouldn't have merged with master so early
Well, I argue that we do care about what happens in case the value to insert is already present in the array. The whole point of the side
parameter is to decide where to insert that value in case of this ambiguity. Let me first show you some examples and then I'll explain my use case:
If the value is not preset in the array, so both sides give the same answer:
>>> a = np.array([1, 2, 3, 5, 6, 7])
>>> v = 4
>>> np.searchsorted(a, v, side='left')
3
>>> np.searchsorted(a, v, side='right')
3
With side=left
the returned index will place the new value to the left of identical values present in the array:
>>> a = np.array([1, 2, 3, 3, 3, 3, 3, 3, 6, 7])
>>> v = 3
>>> np.searchsorted(a, v, side='left')
2
On the contrary, side=right
will give an insertion index that places the new value on the right of identical values present in the array:
>>> a = np.array([1, 2, 3, 3, 3, 3, 3, 3, 6, 7])
>>> v = 3
>>> np.searchsorted(a, v, side='left')
8
Now, for my use case. I use searchsorted to bucketize a series of real-valued distances into bins of unequal size. E.g. the distance .3
must end up in the first bin [0, 1)
and the distance 3.4
must end up in the second bin [1, 5)
etc. The bin indexes then are one-hot encoded and become input features for a model.
The bounds of these buckets are represented like this, with the last bucket being [100, +inf)]
:
buckets = [0, 1, 5, 10, 20, 100]
Therefore, for these input distances:
dist = [0, .3, 4, 4.5, 5, 6.8, 23.4, 123, 401]
I expect this output:
idxs = [0, 0, 1, 1, 2, 2, 4, 5, 5]
Which I can get by doing
>>> np.searchsorted(buckets, dist, side='right') - 1
[0, 0, 1, 1, 2, 2, 4, 5, 5]
While doing this would give the wrong index for items on the bucket boundaries:
>>> np.searchsorted(buckets, dist, side='left')
[0, 1, 2, 2, 2, 3, 5, 6, 6]
Ah about the broadcasting behavior for the batch dimension. I understand now what that test and the sel
function were about. I think it could be written more clearly by using numpy's broadcast_arrays and pytorch's broadcast_tensors. Those utility functions take care of the broadcasting for us and are tested in the respective projects.
I can update the python code where needed, should be quite quick. Then some time next week I'll have a look at the C++ and CUDA parts to address the side
logic ;)
ha! thanks for the explanations! ok, feel free to update the code and make a new PR when ready
I have a working version of this in my fork. I rewrote most of the search code, which should now be more readable and possibly faster. Do you want to check it out?
I'll send a PR very soon, just want to write a benchmark script ;)
Hi @baldassarreFe
ok, cool, thanks for this ! Please make sure that you start from a fresh version before doing a PR ! best
antoine