CI test runs can fail because of outdated data cache when making changes in data directory
Version Checks (indicate both or one)
-
[x] I have confirmed this bug exists on the lastest release of PyPSA-Eur.
-
[x] I have confirmed this bug exists on the current
masterbranch of PyPSA-Eur.
Issue Description
CI test runs can fail unexpectedly when someone creates a PR that introduces changes to a file in the data directory. The reason for this is that the data directory is cached by default on a weekly schedule but not again when changes are introduced to the data directory.
Although this is a very niche issue and only occurs in the intermediate time of the weekly caching schedule, it is probably worth introducing a fix to the CI caching action that enables continuous passing of the CI test runs.
Reproducible Example
See this PR in a PyPSA-Eur fork: https://github.com/open-energy-transition/open-tyndp/actions/runs/19131935060/job/54674570111#step:12:5481
As newly introduced columns in a file in the data directory are not recognised by the CI, even though the changes are included in the PR.
Expected Behavior
One would expect the CI to take changes to the data directory into account.
Installed Versions
Maybe @lkstrp has a good idea if / how this should be addressed. I will have some time next week to test out some solutions
We can simply invalidate the cache based on changes in the data dir. Maybe we include the hash of versions.csv in the cache key? With the data layer you would always need to touch it when adding new data
We can simply invalidate the cache based on changes in the data dir. Maybe we include the hash of
versions.csvin the cache key? With the data layer you would always need to touch it when adding new data
I think rather than invalidating the cache it makes sense to exclude everything that is in git from caching. For the new data layer that is especially important for versions.csv then we don't need to invalidate the cache. Even though it would make sense to update it.
ie. we could move everything that is checked into the repo to: data/git for instance and then change the cache to:
- uses: actions/cache@v4
with:
path: |
data
!data/versions.csv
!data/git
cutouts
key: data-cutouts-${{ env.WEEK }}
i think.
ie. we could move everything that is checked into the repo to:
data/gitfor instance and then change the cache to:- uses: actions/cache@v4 with: path: | data !data/versions.csv !data/git cutouts key: data-cutouts-${{ env.WEEK }}i think.
I'm against moving everything from the repo into a data/git folder - that change would be technically motivated and obfuscate which data is where even further (I'd prefer to have folders like data/custom for the overwrites instead.
I think invalidating the cache is a clean solution: We create a hash from the checked-out version of the data/ folder (that would also work with data/versions.csv) and use that as the cache key before the cache is restored. If the key matches the cache is restored, else it is regenerated.
It is the same mechanism that we use for updating the cache weekly, just incorporating one more info.