pytest icon indicating copy to clipboard operation
pytest copied to clipboard

Add hash comparison for pyc cache files

Open cdce8p opened this issue 2 years ago • 3 comments

Compare pyc cache files by source hash.

Especially in CI environments, it's common for mtimes to be reset when restoring a cache. To work around that, fallback to a source hash comparison if both size and mtime didn't match.

Related:

  • https://docs.python.org/3/reference/import.html#pyc-invalidation
  • https://docs.python.org/3/library/importlib.html#importlib.util.source_hash
  • https://github.com/psf/black/pull/3821

cdce8p avatar Sep 08 '23 23:09 cdce8p

Hi folks,

Late to the discussion, but my gut feeling is that we should avoid introducing a new option, and use the optimal method behind the scenes always. I like the approach of using size +mtime as shortcut and fallback to hashing if they still match.

I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that?

We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently.

nicoddemus avatar Sep 13 '23 13:09 nicoddemus

I understand this is not standard according to the PEP-552 (I haven't read it, just gleaming it from the discussion here), but deviating from it will somehow turn the .pyc files we generate unreadable by external tools? What are the consequences of that?

The pyc header doesn't allow both hash and timestamp, so a pyc files needs to be one or the other.

We should also benchmark this carefully... generating hashes is fast, but I suspect doing it for every file in a large repository will be noticeable more costly than what we do currently.

I ran a benchmark here, my conclusion is that for the initial run, hashing is unlikely to be noticeable even for large repositories, compared to actually reading and parsing the files. The overhead probably comes in on a cache hit -- with a timestamp you only need to do a stat of the py file, with hashing you need to read it again. If running on huge test files and a slow disk, it might be noticeable.

bluetech avatar Sep 13 '23 19:09 bluetech

Hi guys, I'd like to continue with this one. Since the time I left it, the testing structure in Home Assistant was changed which allows for a better performance comparison now. Just a little preview, using this PR with --invalidation-mode=checked-hash yields close to a 50% improvement in our test collection times with Github actions. Of course that's highly project depended, but an improvement is noticeable.

For Home Assistant we now run a --collect-only job before the actual tests. With that we create custom test buckets for the test matrix jobs. Here are some numbers

Before After
Collection 05:38min 03:48min +48%
Tests 45:05min 41:27min +8.7%
Total 50:43min 45:15min +12%

Home Assistant currently has around 35k tests in 3.3k different files.

I like the approach of using size +mtime as shortcut and fallback to hashing if they still match.

Tbh I don't really have a preference here as long as there is a way to tell pytest to also use the hash. If you like to keep the pyc file format, then the current state with --invalidation-mode is probably the way to go. Maybe we could add a fallback to SOURCE_DATE_EPOCH like cpython has as well. Otherwise we would essentially create a new format. Not sure if that's desirable.

I think we can just use hash mode always and it will be fine.

Probably. On any modern machine this shouldn't really be an issue. Although I'm not sure about edge cases. The size+mtime comparison works well (just not in CI environments), so I don't really see a need to change that. If at all, I would probably just go with the custom file format and then use the hash as fallback like originally suggested. This would at least simplify the logic a bit.

--

One followup would be an independent way to control if pyc files should be written. I'd like to use PYTHONDONTWRITEBYTECODE: 1 to be able to only cache the pytest pyc files.

One might think that caching the regular bytecode files would be an improvement as well. However, the benefit was only minor (+2%) while it required manually compiling all files with py_compile to be able to also use the hash comparison for those. I think it's likely that the C loop to create the bytecode files is simply much faster than the pytest code for assertion rewriting.

cdce8p avatar Apr 27 '24 22:04 cdce8p