aesara icon indicating copy to clipboard operation
aesara copied to clipboard

Cache miss occurs due to use of hash() in key

Open mattearllongshot opened this issue 3 years ago • 3 comments

Description of your problem or feature request

In certain circumstances we end up having incorrect cache misses leading to extra storage / CPU usage.

To reproduce, save the following script as bug.py:

import aesara
import aesara.tensor as at

params = at.vector()
probabilities = params + 1
aesara.function([params], [probabilities])

Run the following commands:

$ rm -r ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/
$ python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmwdm85mq:
__init__.py  key.pkl  mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so  mod.cpp  __pycache__

/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpyzxlvysm:
__init__.py  key.pkl  m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so  mod.cpp  __pycache__
$ python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmwdm85mq:
__init__.py  key.pkl  mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so  mod.cpp  __pycache__

/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpoi084cuv:
__init__.py  key.pkl  mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so  mod.cpp  __pycache__

/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpyzxlvysm:
__init__.py  key.pkl  m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so  mod.cpp  __pycache__

On the first run, there are two cache dirs, but on the second there are three, with an identical module. The key files are different but the mod.cpp is the same. Subsequent runs add additional duplicate directories.

Running with PYTHONHASHSEED=1234 stops the duplicates being created:

$ rm -r ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/
$ PYTHONHASHSEED=1234 python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp7bj_c23_:
__init__.py  key.pkl  m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so  mod.cpp  __pycache__

/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmrtrbg48:
__init__.py  key.pkl  mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so  mod.cpp  __pycache__
$ PYTHONHASHSEED=1234 python bug.py
$ ls ~/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp*
/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmp7bj_c23_:
__init__.py  key.pkl  m5d69910cf9a4555423d5e68cc8357d194094f92b3878920c98b2a24e728523c5.so  mod.cpp  __pycache__

/home/matthew/.aesara/compiledir_Linux-4.15--generic-x86_64-with-glibc2.10-x86_64-3.8.13-64/tmpmrtrbg48:
__init__.py  key.pkl  mba10987274f369529454d4a60996746c3927712a0b1b3928446a3c275151e2ee.so  mod.cpp  __pycache__

I believe this is due to parts of the key being computed using python's hash which by default is seeded differently for each new process, for example ExternalCOp.c_code_cache_version.

Versions and main components

  • Aesara version:2.7.9
  • Aesara config (python -c "import aesara; print(aesara.config)") aesara_config.txt
  • Python version: Python 3.8.13
  • Operating system: Ubuntu 18.04.6 LTS
  • How did you install Aesara: conda

mattearllongshot avatar Aug 05 '22 15:08 mattearllongshot

Thank you for opening an issue. Do you still observe this behavior after setting the environment variable PYTHONHASHSEED to e.g. 0? See the python documentation.

rlouf avatar Aug 09 '22 16:08 rlouf

Hello, PYTHONHASHSEED does indeed fix this, although it's not ideal since we want to be able to keep the benefits of randomized hashing (eg. avoid dictionary keying attacks).

mattearllongshot avatar Aug 09 '22 19:08 mattearllongshot

Hello, PYTHONHASHSEED does indeed fix this, although it's not ideal since we want to be able to keep the benefits of randomized hashing (eg. avoid dictionary keying attacks).

Thank you for reporting the result, this was just to make sure that hash was the sole culprit here.

rlouf avatar Aug 09 '22 19:08 rlouf