Cache miss occurs due to use of hash() in key
Description of your problem or feature request
In certain circumstances we end up having incorrect cache misses leading to extra storage / CPU usage.
To reproduce, save the following script as bug.py:
import aesara
import aesara.tensor as at
params = at.vector()
probabilities = params + 1
aesara.function([params], [probabilities])
Run the following commands:
$ rm -r ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/
$ python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmwdm85mq:
__init__.py key.pkl mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so mod.cpp __pycache__
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpyzxlvysm:
__init__.py key.pkl m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so mod.cpp __pycache__
$ python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmwdm85mq:
__init__.py key.pkl mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so mod.cpp __pycache__
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpoi084cuv:
__init__.py key.pkl mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so mod.cpp __pycache__
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpyzxlvysm:
__init__.py key.pkl m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so mod.cpp __pycache__
On the first run, there are two cache dirs, but on the second there are three, with an identical module. The key files are different but the mod.cpp is the same. Subsequent runs add additional duplicate directories.
Running with PYTHONHASHSEED=1234 stops the duplicates being created:
$ rm -r ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/
$ PYTHONHASHSEED=1234 python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp7bj_c23_:
__init__.py key.pkl m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so mod.cpp __pycache__
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmrtrbg48:
__init__.py key.pkl mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so mod.cpp __pycache__
$ PYTHONHASHSEED=1234 python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp7bj_c23_:
__init__.py key.pkl m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so mod.cpp __pycache__
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmrtrbg48:
__init__.py key.pkl mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so mod.cpp __pycache__
I believe this is due to parts of the key being computed using python's hash which by default is seeded differently for each new process, for example ExternalCOp.c_code_cache_version.
Versions and main components
- Aesara version:2.7.9
- Aesara config (
python -c "import aesara; print(aesara.config)") aesara_config.txt - Python version: Python 3.8.13
- Operating system: Ubuntu 18.04.6 LTS
- How did you install Aesara: conda
Thank you for opening an issue. Do you still observe this behavior after setting the environment variable PYTHONHASHSEED to e.g. 0? See the python documentation.
Hello, PYTHONHASHSEED does indeed fix this, although it's not ideal since we want to be able to keep the benefits of randomized hashing (eg. avoid dictionary keying attacks).
Hello,
PYTHONHASHSEEDdoes indeed fix this, although it's not ideal since we want to be able to keep the benefits of randomized hashing (eg. avoid dictionary keying attacks).
Thank you for reporting the result, this was just to make sure that hash was the sole culprit here.