gensim random.RandomState with different versions of numpy has vastly different performance

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1

produced 0.28105926513671875 exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function. I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

Apr 03 '20 17:04 alexcarterkarsus

Thanks for reporting. What was the difference in random.RandomState that you nailed it down to?

We should replace RandomState if it's deprecated. What should it be replaced with, is the change 100% backward compatible? Can you open a PR?

Apr 03 '20 17:04 piskvorky

what I found is that numpy.random.RandomState defaults to the legacy random generator, i.e., the slowest, however, there are other options: PCG64, SFC64, etc. You could simply use them (after import numpy.random as random) with, e.g., random.SFC64, instead of random.RandomState.Same code as above but using SFC64, even with numpy= 1.18.1, would give 2.767242193222046 which is the fastest of the bunch, although still not as fast as in numpy=1.14.3 I don't know exactly what caused the performance change inside numpy, and I also don't know how you can easily amend the code without breaking compatibility.

Apr 03 '20 19:04 alexcarterkarsus

I don't think it's necessary (or reasonable for users to expect) that the same seeding initializes vectors identically across changed gensim versions. (The starting state shouldn't be particularly important for end results under realistic uses – such as multithreaded training, which introduces lots more uncontrollable randomness.)

So we should just choose whatever numpy recommends as a fast pseudorandom generator.

Apr 05 '20 04:04 gojomo

Agreed. A 60x slowdown is really worrying.

Apr 05 '20 09:04 piskvorky

Ultimately any other uses of RandomState in gensim – https://github.com/RaRe-Technologies/gensim/search?q=RandomState&type=Code – will face the same slowdown unless updated to match numpy best-practices.

Jun 29 '20 20:06 gojomo

@zygm0nt do you think you could take care of all the places gensim uses the old (slow) numpy RNG?

Jun 29 '20 21:06 piskvorky

Sure, I can take care of that. Should this be done in this ticket? by reopening it?

Jun 29 '20 21:06 zygm0nt

Yeah, let me reopen. Thanks!

Jun 29 '20 22:06 piskvorky

Hi, @zygm0nt @piskvorky , Any updates on this issue? I would like to contribute to a PR if it is already created. Also, please help me with the best practices I must follow during a PR. Are there any unit tests to pass before creating a PR. Or do I need to check the integrity of the code locally (in which case point me to any documentation that could help)

Thanks, looking forward to your reply.

Aug 26 '21 12:08 SagarDollin

Sure! The code style and instructions for Gensim are here: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page Thanks.

Aug 26 '21 12:08 piskvorky

Hi @piskvorky @gojomo, I'm here with some updates about the work I did today. From the above comments, it is clear that we need to replace all instances of RandomState and RandomState objects as it is deprecated from new versions of NumPy and also because it is 60 times slower.

What can we use to replace RandomState? Ans: Instead of using np.random.RandomState(seed) we can use np.random.default_rng(seed) . default_rng instantiates a Generator with numpy's default BitGenerator(PCG64). For more info refer this documentation.

Will making the above change introduce new bugs? Ans: The above change could potentially introduce new bugs if we do not take care of lines of code dependent on RandomState because we are using a Generator which is a replacement for RandomState , but they are not the exact same. To avoid any bugs, I have to do modifications wherever we have a dependency on RandomState.

To deal with any dependencies, I did some digging into the code, and I found few instances that have a dependency on RandomState, which will not cause any new bug given we do few modifications. Refer below for all the details(you can skip this part):

In the file gensim/utils the function get_random_state(seed) has the line return np.random.mtrand._rand on line 90, but the returned object np.random.mtrand._rand is an instance of RandomState(refer to the screenshot below) The simple solution to this dependency is to replace np.random.mtrand._rand with an instance of Generator i.e., np.random.deafult_rng().
There are other dependencies like in the file gensim/models/poincare.py there are instances of like these:
1. self._np_random.uniform(..)
2. self._np_random.randint(..)
3. self._np_random.choice(..)
4. self._np_random.shuffle(..) where self._np_random = np_random.RandomState(seed). Now since we are replacing RandomState , the self._np_random will be a Generator, and we have the implementation of uniform(..), choice(..) , shuffle(..) with the same parameters, except randint in the Generator. I need to replace all occurrences of randint(..) with integers(..) that basically take the same parameters as randint in our case and do the same job.

You can see more info about these methods here in the documentation.

Note: Here, in any line of code I have mentioned with np, I'm referring to the numpy package.

To Conclude Let me know if anyone has any other ideas or concerns with the above method soon, as I will start working on the PR immediately.

Aug 26 '21 19:08 SagarDollin

Sounds good to me, thanks.

After the replacement, can you do a sanity check re. performance? The new code should be faster (or at least not slower) than the existing code. I don't expect the overall impact will be too large (RNG is just a small part of the ML algos), but it would still be nice to include some concrete numbers in the release notes.

Aug 26 '21 20:08 piskvorky

Hey @piskvorky, Thanks for your comment. Sure I will include a sanity test. I needed some clarifications for the sanity testing.

The Poincare model in the gensim/models/poincare.py, which seems to be heavily relying on RandomState methods. So I will be testing this model after the change. However I might need a big dataset in order to test this. The dataset poincare_hypernyms_large might be a little small for that. Let me know if you can point me to any other bigger dataset, or should I go forward with the current dataset?
Apart from Poincare there is word2vec that has RandomState but it is not using any of the methods of the RandomState. So I'm guessing testing this would give us the same results after making changes.
Should I include the sanity test file in the PR? Or is a screenshot of the results enough?

Thanks, Sagar B Dollin

Aug 27 '21 18:08 SagarDollin

Should I include the sanity test file in the PR? Or is a screenshot of the results enough?

A summary of your benchmark as part of the PR description is enough. Nothing fancy or formal – really, a rudimentary sanity check. Thanks.

Aug 27 '21 18:08 piskvorky

Hi @piskvorky, The PR is ready for review. Things achieved in this PR:

I have resolved all the dependencies on RandomState.
The code is still backward compatible in the sense, we can load a pre-trained model that relies on RandomState instead of the Generator and still be able to run it.

Note: I had to change few test files as well, as some tests were relying on hardcoded variables, and since we are using a new Random Generator, they tended to fail. So I have updated these tests. However, I'm not an expert in all the models implemented in this repo, so a review is required for any of these changes.

Thanks.

Aug 30 '21 21:08 SagarDollin

gensim gensim copied to clipboard

random.RandomState with different versions of numpy has vastly different performance

gensim
gensim copied to clipboard