gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Fix inconsistent shape of matrix returned by `scipy2scipy_clipped`

Open psorianom opened this issue 7 years ago • 7 comments
trafficstars

Fixed inconsistent shape of matrix returned by scipy2scipy_clipped, as described in issue #2065 Solved by explicitly setting the second dimension of the new matrix to that of the input matrix.

Fixes #2065.

psorianom avatar May 25 '18 20:05 psorianom

hi @psorianom, can you please add tests for this function (at least a case that you described in issue)?

menshikh-iv avatar Aug 01 '18 08:08 menshikh-iv

hey @menshikh-iv , thanks for checking this out. Sure, I will do it as soon as possible.

psorianom avatar Aug 02 '18 18:08 psorianom

The entire function looks strange. Its documentation doesn't match the code comments:

Return a scipy.sparse vector/matrix consisting of 'topn' elements of the greatest magnitude (absolute value). versus # Sort and clip each row vector first. (so it's actually topn per row?) versus # use np.argpartition/argsort and only form tuples that are actually returned. (??)

@menshikh-iv do we have good unit tests here? The code looks dodgy.

piskvorky avatar Aug 03 '18 06:08 piskvorky

@piskvorky we have only one test for this function

https://github.com/RaRe-Technologies/gensim/blob/9c6db73919d032ab2f6ea35b3a9043e3b0d2aed5/gensim/test/test_similarities.py#L94-L110

menshikh-iv avatar Aug 03 '18 06:08 menshikh-iv

The entire function looks strange. Its documentation doesn't match the code comments:

Return a scipy.sparse vector/matrix consisting of 'topn' elements of the greatest magnitude (absolute value). versus # Sort and clip each row vector first. (so it's actually topn per row?) versus # use np.argpartition/argsort and only form tuples that are actually returned. (??)

Hi all, sorry for the delay.

Regarding the comments, yes, it is per row, so if the input is an array of n rows, the output will have n rows. This differs from full2sparse_clipped which takes row by row (see in SimilarityABC interface: return [matutils.full2sparse_clipped(v, self.num_best) for v in result])

All in all, I realize that the bug I created is not actually a bug per se, i.e., the code does what it should do according to its description (return the 'topn' elements of the greatest magnitude) without regards to the original size of the matrix.

In my use case, I do need the original size of the matrix, as I use it later for other calculations. So maybe adding a flag to full2sparse_clipped and scipy2scipy_clipped such as "keep_size" would do the trick ? What do you think ?

@menshikh-iv do we have good unit tests here? The code looks dodgy.

psorianom avatar Oct 15 '18 10:10 psorianom

@psorianom hello, sorry for delay

In my use case, I do need the original size of the matrix, as I use it later for other calculations. So maybe adding a flag to full2sparse_clipped and scipy2scipy_clipped such as "keep_size" would do the trick ?

maybe, sounds compatible :+1:

menshikh-iv avatar Jan 17 '19 01:01 menshikh-iv

If so, it needs a clear description + motivation. I still find the docstrings confusing (and contradictory to code), and adding extra parameters won't help.

piskvorky avatar Jan 17 '19 08:01 piskvorky