gensim
gensim copied to clipboard
Fix inconsistent shape of matrix returned by `scipy2scipy_clipped`
Fixed inconsistent shape of matrix returned by scipy2scipy_clipped, as described in issue #2065 Solved by explicitly setting the second dimension of the new matrix to that of the input matrix.
Fixes #2065.
hi @psorianom, can you please add tests for this function (at least a case that you described in issue)?
hey @menshikh-iv , thanks for checking this out. Sure, I will do it as soon as possible.
The entire function looks strange. Its documentation doesn't match the code comments:
Return a scipy.sparse vector/matrix consisting of 'topn' elements of the greatest magnitude (absolute value).
versus
# Sort and clip each row vector first. (so it's actually topn per row?)
versus
# use np.argpartition/argsort and only form tuples that are actually returned. (??)
@menshikh-iv do we have good unit tests here? The code looks dodgy.
@piskvorky we have only one test for this function
https://github.com/RaRe-Technologies/gensim/blob/9c6db73919d032ab2f6ea35b3a9043e3b0d2aed5/gensim/test/test_similarities.py#L94-L110
The entire function looks strange. Its documentation doesn't match the code comments:
Return a scipy.sparse vector/matrix consisting of 'topn' elements of the greatest magnitude (absolute value).versus# Sort and clip each row vector first.(so it's actuallytopnper row?) versus# use np.argpartition/argsort and only form tuples that are actually returned.(??)
Hi all, sorry for the delay.
Regarding the comments, yes, it is per row, so if the input is an array of n rows, the output will have n rows. This differs from full2sparse_clipped which takes row by row (see in SimilarityABC interface: return [matutils.full2sparse_clipped(v, self.num_best) for v in result])
All in all, I realize that the bug I created is not actually a bug per se, i.e., the code does what it should do according to its description (return the 'topn' elements of the greatest magnitude) without regards to the original size of the matrix.
In my use case, I do need the original size of the matrix, as I use it later for other calculations. So maybe adding a flag to full2sparse_clipped and scipy2scipy_clipped such as "keep_size" would do the trick ? What do you think ?
@menshikh-iv do we have good unit tests here? The code looks dodgy.
@psorianom hello, sorry for delay
In my use case, I do need the original size of the matrix, as I use it later for other calculations. So maybe adding a flag to full2sparse_clipped and scipy2scipy_clipped such as "keep_size" would do the trick ?
maybe, sounds compatible :+1:
If so, it needs a clear description + motivation. I still find the docstrings confusing (and contradictory to code), and adding extra parameters won't help.