gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Updated RandomState (deprecated from numpy) to default_rng (Generator)

Open SagarDollin opened this issue 4 years ago • 5 comments
trafficstars

This is regarding issue #2782 @piskvorky Here are the benchmarks of before and after updating:

Files updated test file Before Update After Update
Poincare.py test_poincare.py Ran 42 tests in 0.418s Ran 42 tests in 0.417s
test_ldamodel.py, ldamodel.py test_ldamodel.py Ran 48 tests in 223.845s Ran 48 tests in 225.561s
utils.py test_utils.py Ran 24 tests in 0.007s Ran 24 tests in 0.007s
test_matutils.py test_matutils.py Ran 18 tests in 0.071s Ran 18 tests in 0.070s
word2vec.py test_word2vec.py Ran 79 tests in 58.149s Ran 79 tests in 57.950s

I don't find a big difference in the time taken to run after the update. However, I feel it is good to be updated along with Numpy.

Why did you create this PR? To update RandomState occurrences to default_rng as RandomState is deprecated from NumPy.

What functionality did you set out to improve? I have updated the code such that it now does not need to rely on RandomState, but also the code is backward compatible. If we load a pre-trained older version model in this repo, it will be able to run smoothly as default_rng supports all the methods present in RandomState except for randint for randint we have replaced it with integers, but for backward compatibility, I have done something like this:

if isinstance(random_state , np.random.RandomState)
    random_state.randint(..)
else:
    random_sate.integers(..)

The above makes sure that if random_sate is Generator object, we use integers; otherwise, if it's a RandomState object, we use randint for backward compatibility.

What was the problem + an overview of how you fixed it? In the issue, it was claimed that RandomState made the code slower, but I do not find much difference(This could be because we are running it on relatively small data). However, it is good practice to use the updated versions and replace the deprecated ones.

Whom does it affect, and how should people use it? It affects everyone who uses gensim framework, SDE, Researchers, etc.

SagarDollin avatar Aug 28 '21 21:08 SagarDollin

Things achieved in this PR:

  1. I have resolved all the dependencies on RandomState.
  2. The code is still backward compatible in the sense, we can load a pre-trained model that relies on RandomState instead of the Generator and still be able to run it.

SagarDollin avatar Aug 31 '21 06:08 SagarDollin

The build-wheels checks are failing, I think this is because the file build-wheels. A change was made 13 days ago where they added username. The error we are getting is also related to username: image

Should we create an issue for this? I observed that in the previous pull requests this issue wasn't seen. In fact, the newer PR seems to be having more checks than previous ones. A similar case can be seen in PR #3222

SagarDollin avatar Aug 31 '21 11:08 SagarDollin

Thanks for investigating! TBH I'm not a fan of all the if-then logic. It will be really hard to maintain.

When does numpy actually drop the deprecated RandomState? We should make a hard cliff and switch over, without the ifs.

piskvorky avatar Feb 19 '22 15:02 piskvorky

Hey @piskvorky Thanks for your feedback. I understand your concerns. I'll give some thought to the feedback and start working on it soon.

SagarDollin avatar Feb 20 '22 09:02 SagarDollin

@SagarDollin Are you still interested in working on this?

mpenkov avatar Apr 08 '24 03:04 mpenkov