pypsa-eur icon indicating copy to clipboard operation
pypsa-eur copied to clipboard

CI test runs can fail because of outdated data cache when making changes in data directory

Open daniel-rdt opened this issue 1 month ago • 5 comments

Version Checks (indicate both or one)

  • [x] I have confirmed this bug exists on the lastest release of PyPSA-Eur.

  • [x] I have confirmed this bug exists on the current master branch of PyPSA-Eur.

Issue Description

CI test runs can fail unexpectedly when someone creates a PR that introduces changes to a file in the data directory. The reason for this is that the data directory is cached by default on a weekly schedule but not again when changes are introduced to the data directory.

Although this is a very niche issue and only occurs in the intermediate time of the weekly caching schedule, it is probably worth introducing a fix to the CI caching action that enables continuous passing of the CI test runs.

Reproducible Example

See this PR in a PyPSA-Eur fork: https://github.com/open-energy-transition/open-tyndp/actions/runs/19131935060/job/54674570111#step:12:5481

As newly introduced columns in a file in the data directory are not recognised by the CI, even though the changes are included in the PR.

Expected Behavior

One would expect the CI to take changes to the data directory into account.

Installed Versions

Replace this line.

daniel-rdt avatar Nov 07 '25 10:11 daniel-rdt

Maybe @lkstrp has a good idea if / how this should be addressed. I will have some time next week to test out some solutions

daniel-rdt avatar Nov 07 '25 10:11 daniel-rdt

We can simply invalidate the cache based on changes in the data dir. Maybe we include the hash of versions.csv in the cache key? With the data layer you would always need to touch it when adding new data

lkstrp avatar Nov 07 '25 17:11 lkstrp

We can simply invalidate the cache based on changes in the data dir. Maybe we include the hash of versions.csv in the cache key? With the data layer you would always need to touch it when adding new data

I think rather than invalidating the cache it makes sense to exclude everything that is in git from caching. For the new data layer that is especially important for versions.csv then we don't need to invalidate the cache. Even though it would make sense to update it.

coroa avatar Nov 07 '25 17:11 coroa

ie. we could move everything that is checked into the repo to: data/git for instance and then change the cache to:

    - uses: actions/cache@v4
      with:
        path: |
          data
          !data/versions.csv
          !data/git
          cutouts
        key: data-cutouts-${{ env.WEEK }}

i think.

coroa avatar Nov 07 '25 17:11 coroa

ie. we could move everything that is checked into the repo to: data/git for instance and then change the cache to:

    - uses: actions/cache@v4
      with:
        path: |
          data
          !data/versions.csv
          !data/git
          cutouts
        key: data-cutouts-${{ env.WEEK }}

i think.

I'm against moving everything from the repo into a data/git folder - that change would be technically motivated and obfuscate which data is where even further (I'd prefer to have folders like data/custom for the overwrites instead.

I think invalidating the cache is a clean solution: We create a hash from the checked-out version of the data/ folder (that would also work with data/versions.csv) and use that as the cache key before the cache is restored. If the key matches the cache is restored, else it is regenerated. It is the same mechanism that we use for updating the cache weekly, just incorporating one more info.

euronion avatar Nov 10 '25 08:11 euronion