gensim Fix inconsistent shape of matrix returned by `scipy2scipy

trafficstars

Fixed inconsistent shape of matrix returned by scipy2scipy_clipped, as described in issue #2065 Solved by explicitly setting the second dimension of the new matrix to that of the input matrix.

Fixes #2065.

May 25 '18 20:05 psorianom

hi @psorianom, can you please add tests for this function (at least a case that you described in issue)?

Aug 01 '18 08:08 menshikh-iv

hey @menshikh-iv , thanks for checking this out. Sure, I will do it as soon as possible.

Aug 02 '18 18:08 psorianom

The entire function looks strange. Its documentation doesn't match the code comments:

Return a scipy.sparse vector/matrix consisting of 'topn' elements of the greatest magnitude (absolute value). versus # Sort and clip each row vector first. (so it's actually topn per row?) versus # use np.argpartition/argsort and only form tuples that are actually returned. (??)

@menshikh-iv do we have good unit tests here? The code looks dodgy.

Aug 03 '18 06:08 piskvorky

@piskvorky we have only one test for this function

https://github.com/RaRe-Technologies/gensim/blob/9c6db73919d032ab2f6ea35b3a9043e3b0d2aed5/gensim/test/test_similarities.py#L94-L110

Aug 03 '18 06:08 menshikh-iv

The entire function looks strange. Its documentation doesn't match the code comments:

Return a scipy.sparse vector/matrix consisting of 'topn' elements of the greatest magnitude (absolute value). versus # Sort and clip each row vector first. (so it's actually topn per row?) versus # use np.argpartition/argsort and only form tuples that are actually returned. (??)

Hi all, sorry for the delay.

Regarding the comments, yes, it is per row, so if the input is an array of n rows, the output will have n rows. This differs from full2sparse_clipped which takes row by row (see in SimilarityABC interface: return [matutils.full2sparse_clipped(v, self.num_best) for v in result])

All in all, I realize that the bug I created is not actually a bug per se, i.e., the code does what it should do according to its description (return the 'topn' elements of the greatest magnitude) without regards to the original size of the matrix.

In my use case, I do need the original size of the matrix, as I use it later for other calculations. So maybe adding a flag to full2sparse_clipped and scipy2scipy_clipped such as "keep_size" would do the trick ? What do you think ?

@menshikh-iv do we have good unit tests here? The code looks dodgy.

Oct 15 '18 10:10 psorianom

@psorianom hello, sorry for delay

In my use case, I do need the original size of the matrix, as I use it later for other calculations. So maybe adding a flag to full2sparse_clipped and scipy2scipy_clipped such as "keep_size" would do the trick ?

maybe, sounds compatible :+1:

Jan 17 '19 01:01 menshikh-iv

If so, it needs a clear description + motivation. I still find the docstrings confusing (and contradictory to code), and adding extra parameters won't help.

Jan 17 '19 08:01 piskvorky

gensim
gensim copied to clipboard

Fix inconsistent shape of matrix returned by `scipy2scipy_clipped`

gensim gensim copied to clipboard

Fix inconsistent shape of matrix returned by `scipy2scipy_clipped`

gensim
gensim copied to clipboard