gensim
gensim copied to clipboard
random.RandomState with different versions of numpy has vastly different performance
the performance of random.RandomState in word2vec.py (version 3.8.0)
def seeded_vector(self, seed_string, vector_size):
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(vector_size) - 0.5) / vector_size
seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code
from numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1
produced
0.28105926513671875
exactly the same code with numpy= 1.18.1 produced
18.590345859527588
I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function.
I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!
Thanks for reporting. What was the difference in random.RandomState that you nailed it down to?
We should replace RandomState if it's deprecated. What should it be replaced with, is the change 100% backward compatible? Can you open a PR?
what I found is that numpy.random.RandomState defaults to the legacy random generator, i.e., the slowest, however, there are other options: PCG64, SFC64, etc. You could simply use them (after import numpy.random as random) with, e.g., random.SFC64, instead of random.RandomState.Same code as above but using SFC64, even with numpy= 1.18.1, would give
2.767242193222046
which is the fastest of the bunch, although still not as fast as in numpy=1.14.3
I don't know exactly what caused the performance change inside numpy, and I also don't know how you can easily amend the code without breaking compatibility.
I don't think it's necessary (or reasonable for users to expect) that the same seeding initializes vectors identically across changed gensim versions. (The starting state shouldn't be particularly important for end results under realistic uses – such as multithreaded training, which introduces lots more uncontrollable randomness.)
So we should just choose whatever numpy recommends as a fast pseudorandom generator.
Agreed. A 60x slowdown is really worrying.
Ultimately any other uses of RandomState in gensim – https://github.com/RaRe-Technologies/gensim/search?q=RandomState&type=Code – will face the same slowdown unless updated to match numpy best-practices.
@zygm0nt do you think you could take care of all the places gensim uses the old (slow) numpy RNG?
Sure, I can take care of that. Should this be done in this ticket? by reopening it?
Yeah, let me reopen. Thanks!
Hi, @zygm0nt @piskvorky , Any updates on this issue? I would like to contribute to a PR if it is already created. Also, please help me with the best practices I must follow during a PR. Are there any unit tests to pass before creating a PR. Or do I need to check the integrity of the code locally (in which case point me to any documentation that could help)
Thanks, looking forward to your reply.
Sure! The code style and instructions for Gensim are here: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page Thanks.
Hi @piskvorky @gojomo,
I'm here with some updates about the work I did today.
From the above comments, it is clear that we need to replace all instances of RandomState and RandomState objects as it is deprecated from new versions of NumPy and also because it is 60 times slower.
What can we use to replace RandomState?
Ans: Instead of using np.random.RandomState(seed) we can use np.random.default_rng(seed) . default_rng instantiates a Generator with numpy's default BitGenerator(PCG64). For more info refer this documentation.
Will making the above change introduce new bugs?
Ans: The above change could potentially introduce new bugs if we do not take care of lines of code dependent on RandomState because we are using a Generator which is a replacement for RandomState , but they are not the exact same. To avoid any bugs, I have to do modifications wherever we have a dependency on RandomState.
To deal with any dependencies, I did some digging into the code, and I found few instances that have a dependency on RandomState, which will not cause any new bug given we do few modifications. Refer below for all the details(you can skip this part):
-
In the file gensim/utils the function
get_random_state(seed)has the linereturn np.random.mtrand._randon line 90, but the returned objectnp.random.mtrand._randis an instance ofRandomState(refer to the screenshot below)
The simple solution to this dependency is to replace np.random.mtrand._randwith an instance of Generator i.e.,np.random.deafult_rng().
-
There are other dependencies like in the file gensim/models/poincare.py there are instances of like these:
self._np_random.uniform(..)self._np_random.randint(..)self._np_random.choice(..)self._np_random.shuffle(..)whereself._np_random = np_random.RandomState(seed). Now since we are replacingRandomState, theself._np_randomwill be aGenerator, and we have the implementation ofuniform(..),choice(..),shuffle(..)with the same parameters, exceptrandintin theGenerator.I need to replace all occurrences ofrandint(..)withintegers(..)that basically take the same parameters asrandintin our case and do the same job.
You can see more info about these methods here in the documentation.
Note: Here, in any line of code I have mentioned with np, I'm referring to the numpy package.
To Conclude Let me know if anyone has any other ideas or concerns with the above method soon, as I will start working on the PR immediately.
Sounds good to me, thanks.
After the replacement, can you do a sanity check re. performance? The new code should be faster (or at least not slower) than the existing code. I don't expect the overall impact will be too large (RNG is just a small part of the ML algos), but it would still be nice to include some concrete numbers in the release notes.
Hey @piskvorky, Thanks for your comment. Sure I will include a sanity test. I needed some clarifications for the sanity testing.
- The Poincare model in the gensim/models/poincare.py, which seems to be heavily relying on
RandomStatemethods. So I will be testing this model after the change. However I might need a big dataset in order to test this. The dataset poincare_hypernyms_large might be a little small for that. Let me know if you can point me to any other bigger dataset, or should I go forward with the current dataset? - Apart from Poincare there is word2vec that has
RandomStatebut it is not using any of the methods of theRandomState. So I'm guessing testing this would give us the same results after making changes. - Should I include the sanity test file in the PR? Or is a screenshot of the results enough?
Thanks, Sagar B Dollin
Should I include the sanity test file in the PR? Or is a screenshot of the results enough?
A summary of your benchmark as part of the PR description is enough. Nothing fancy or formal – really, a rudimentary sanity check. Thanks.
Hi @piskvorky, The PR is ready for review. Things achieved in this PR:
- I have resolved all the dependencies on
RandomState. - The code is still backward compatible in the sense, we can load a pre-trained model that relies on
RandomStateinstead of theGeneratorand still be able to run it.
Note: I had to change few test files as well, as some tests were relying on hardcoded variables, and since we are using a new Random Generator, they tended to fail. So I have updated these tests. However, I'm not an expert in all the models implemented in this repo, so a review is required for any of these changes.
Thanks.