pypsa-eur icon indicating copy to clipboard operation
pypsa-eur copied to clipboard

data: Intermediate layer for versioning of datasets

Open euronion opened this issue 7 months ago • 19 comments

This PR suggest an implementation method for an intermediate data layer, that allows to retrieve a specific version of a dataset, rather than always the latest.

The current status is implemented on the OSM dataset, as this one was already tracking versions and working very similarly. I'll expand it to two more datasets in the upcoming days.

Motivation

Upstream data dependencies are not always versioned or fixed, meaning they may change unexpectedly without a way to revert to a different version. This causes reproducibility issues.

Approach

This PR solves the problem in the following way:

  • We create archived versions of all external datasets (if we are allowed -> question of licensing) on e.g. Zenodo
  • The URL for retrieving each combination of (dataset x version) is stored in data/versions.csv. This allows us to switch to a different data plattform or provider if necessary; or use a versioned URL directly from a data provider if available.
  • data/versions.csv also records the license and the description for the dataset. I plan on utilising this information to automatically create new versions of datasets and distribute the license text + metadata information based, as well as utilise this file for the documentation here
  • I imagine that all externally retrieved data will get two rules: (1) a rule for retrieving the upstream version, which I called source: "build" in the config.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.

TODO

  • [ ] Implement for two more datasets
  • [ ] Helper and instructions for creating a new version of a dataset from upstream to archive (small script/CLI tool)
  • [ ] Unbundle the data bundle on Zenodo
  • [ ] Move datasets from data/ to Zenodo, keep files in data that are manual specifications/inputs, e.g. as data/manual
  • [ ] Update documentation to utilise data/versions.csv

Comments are already welcome!

Checklist

  • [ ] I tested my contribution locally and it works as intended.
  • [ ] Code and workflow changes are sufficiently documented.
  • [ ] Changed dependencies are added to envs/environment.yaml.
  • [ ] Changes in configuration options are added in config/config.default.yaml.
  • [ ] Changes in configuration options are documented in doc/configtables/*.csv.
  • [ ] Sources of newly added data are documented in doc/data_sources.rst.
  • [ ] A release note doc/release_notes.rst is added.

euronion avatar May 07 '25 11:05 euronion

Now implemented for the Worldbank Urban Population dataset.

This dataset uses the same method for retrieving from WB as for retrieving from Zenodo (sandbox link for now), the structure also suits itself to providing the upstream information in the data/versions.csv file.

euronion avatar May 07 '25 18:05 euronion

@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!

coroa avatar May 08 '25 07:05 coroa

Thanks @coroa , you can now have a closer look! Let's also have another chat on how it overlaps with your activities.

I've updated the code, now data/versions.csv is included. It also includes a third dataset - the GEM GSPT (Global Steel Plant Tracker by GEM), which I have chosen as a third example, because:

  • It shows how the structure allows us to accommodate external data versioning activities, i.e. GEM does dedicated links to different versions of a dataset, so we don't necessarily need to create a Zenodo mirror (for other reasons we still should for this dataset)
  • It shows how we can track unsupported versions of a dataset in the data/versions.csv, i.e. I have added the newest version of the GSPT as "upstream" and "not supported", because in the new version the file format of the data changed and is no longer compatible with the current workflow. This can also be used to mark datasets, that are no longer compatible as "deprecated"

Finally,

  • I've opted to rename the output file of the GSPT, such that the version is only encoded in the folder, not in the file name (for easier switching to new versions) here https://github.com/PyPSA/pypsa-eur/blob/27534f256c2c206863e0dd82cbcc99512fdaa660/rules/retrieve.smk#L429

  • And to showcase how we can avoid clutter/bugs with the ever more complicating dependencies between rules, I've used the rules.<rule_name>.output["<output_name>"] reference of the GEM GSPT instead of specifying the file name explicitly. We could use this approach instead of manually specifying the paths + versions each time: https://github.com/PyPSA/pypsa-eur/blob/27534f256c2c206863e0dd82cbcc99512fdaa660/rules/build_sector.smk#L696

euronion avatar May 08 '25 09:05 euronion

@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!

Recording an idea: With the structure I propose above, we have a dedicated folder for each data input and version. This would be a good place to store a copy of the LICENSE for that particular dataset as well as a metadata.json.

E.g.

data/worldbank_urban_population
├── 2025-05-07
│   ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.csv
│   ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.zip
│   ├── LICENSE
│   ├── Metadata_Country_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│   ├── Metadata_Indicator_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│   └── metadata.json
│    ...

For any data that we store on Zenodo, we can add them to the Zenodo repo. For datasets that we don't put on Zenodo or that are from upstream, we'd require a different solution for getting/storing the metadata and LICENSE.

Noting @lkstrp that this structure should also allow us to easily exchange "Zenodo" for any other type of data repo that allows for direct access to files, including S3 buckets.

euronion avatar May 09 '25 09:05 euronion

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

lkstrp avatar May 09 '25 09:05 lkstrp

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

euronion avatar May 09 '25 12:05 euronion

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

Seconding the latter idea. Rather than splitting it, removing it from the repo.

coroa avatar May 14 '25 16:05 coroa

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

Seconding the latter idea. Rather than splitting it, removing it from the repo.

Oh yes, if we aim big, let's remove them. But I assume they have a reason to be there in the first place, and we cannot just move them all to Zenodo since the update frequency needs to be much lower there (?). The repo data is updated quite often and moving that to Zenodo will just be a huge pain and not the final solution. Either we come up with some S3 storage in here already, keep it in here for now or use some temp solution (repo/ nextcloud).

lkstrp avatar May 15 '25 08:05 lkstrp

This was not originally about update frequency, it was more about convenience and size constraints. Here you have updates of all files in the data directory over the last year:

data/agg_p_nom_minmax.csv, 1
data/ammonia_plants.csv, 2
data/attributed_ports.json, 0
data/biomass_transport_costs_supplychain1.csv, 1
data/biomass_transport_costs_supplychain2.csv, 1
data/cement-plants-noneu.csv, 1
data/ch_cantons.csv, 0
data/ch_industrial_production_per_subsector.csv, 1
data/custom_extra_functionality.py, 1
data/custom_powerplants.csv, 0
data/district_heat_share.csv, 2
data/egs_costs.json, 1
data/eia_hydro_annual_capacity.csv, 1
data/eia_hydro_annual_generation.csv, 1
data/entsoegridkit/README.md, 0
data/entsoegridkit/buses.csv, 0
data/entsoegridkit/converters.csv, 0
data/entsoegridkit/generators.csv, 0
data/entsoegridkit/lines.csv, 0
data/entsoegridkit/links.csv, 1
data/entsoegridkit/transformers.csv, 0
data/existing_infrastructure/existing_heating_raw.csv, 1
data/gr-e-11.03.02.01.01-cc.csv, 0
data/heat_load_profile_BDEW.csv, 0
data/hydro_capacities.csv, 0
data/links_p_nom.csv, 1
data/nuclear_p_max_pu.csv, 1
data/parameter_corrections.yaml, 1
data/refineries-noneu.csv, 1
data/retro/comparative_level_investment.csv, 0
data/retro/data_building_stock.csv, 0
data/retro/electricity_taxes_eu.csv, 0
data/retro/floor_area_missing.csv, 0
data/retro/retro_cost_germany.csv, 0
data/retro/u_values_poland.csv, 0
data/retro/window_assumptions.csv, 0
data/switzerland-new_format-all_years.csv, 0
data/transmission_projects/manual/new_links.csv, 2
data/transmission_projects/nep/new_lines.csv, 2
data/transmission_projects/nep/new_links.csv, 3
data/transmission_projects/template/new_lines.csv, 1
data/transmission_projects/template/new_links.csv, 1
data/transmission_projects/template/upgraded_lines.csv, 1
data/transmission_projects/template/upgraded_links.csv, 1
data/transmission_projects/tyndp2020/new_lines.csv, 1
data/transmission_projects/tyndp2020/new_links.csv, 2
data/transmission_projects/tyndp2020/upgraded_lines.csv, 1
data/transmission_projects/tyndp2020/upgraded_links.csv, 1
data/unit_commitment.csv, 0

by

for i in $(git ls-files data); do echo $i, $(git log --oneline --since="1 year ago" ${i} | wc -l); done

coroa avatar May 15 '25 11:05 coroa

And this is to frequent for Zenodo

lkstrp avatar May 15 '25 12:05 lkstrp

And this is to frequent for Zenodo

The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?

coroa avatar May 15 '25 12:05 coroa

And this is to frequent for Zenodo

The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?

5 commits for data/transmission_projects and 15 for the rest (out of which 2 should live in technology data, i guess).

Some of the 15 are deletions, some are within the span of a week. Still too many to handle manually i guess.

coroa avatar May 15 '25 12:05 coroa

I don't think we should be handling any of it manually anyways. I was thinking of writing a small CLI script that helps to create new versions to Zenodo.

Not only to make it easier, but also to avoid passing mistakes.

euronion avatar May 15 '25 12:05 euronion

Files like parameter_corrections, or NEP plans, deserve to be version-controlled since they are hand-written rather than imported.

So tracking them in a git repository would still be good practice for them. Maybe does not have to be directly in this repository, but also does not hurt.

Maybe a sub-directory like: data/manual, or a pypsa-eur-data-manual repository, but then this also needs to be maintained and version synced.

coroa avatar May 15 '25 12:05 coroa

Small CLI script sounds good and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.

We need to reupload the whole directory for any new version on Zenodo. Zenodo cannot just update a single file of a bundle. So, if only one of 20 datasets needs an update, we have to reupload them all. This alone is an unpleasant misuse already. But all 20 of them get a new version tag as well, even if for 19 there is no difference between versions. So the whole purpose of versioning of datasets is also gone.

As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.

lkstrp avatar May 15 '25 13:05 lkstrp

Small CLI script sounds good

  • added as open TODO

and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.

Agreed. I wasn't thinking of moving the data from the repo into the data bundle. I was thinking about moving the data from the repo into dedicated Zenodo datasets. One Zenodo URL per standalone dataset. Not what we are doing now with the databundle.

We need to reupload the whole directory for any new version on Zenodo. Yes, and I don't want to repeat that either if we just want to update parts of the data.

As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.

Keeping aside the tags, Zenodo is not build for having a single record contain multiple datasets. What I would be doing is create a dedicated record per dataset. In that case Zenodo is serving our purpose nicely. And since we use the storage(...) provider from Snakemake, we can always just provide a different URL if we want to switch to a storage bucket or another archive - they only need to provide version-specific direct URLs for accessing the datasets

euronion avatar May 15 '25 13:05 euronion

Ok. If we create a single record for each record on Zenodo, I would still argue that this is an unnecessary overhead, but if you want to go for it, I'll give up my resistance. As you say, we can easily switch then 👍

lkstrp avatar May 15 '25 14:05 lkstrp

This is lovely @euronion !

I have a couple of thoughts on the general schema: [...]

Thanks for the feedback @lkstrp - what I understand is that you only have concerns about the schema, but no comments or concerns about the implementation. Is that correct?

Naming

About your schema concerns: I wasn't very happy about my suggestions either, so happy to change them.

Indexing in data/sources.csv

I'm fine with indexing through dataset (or dataset_name, source, version); that's the status quo anyway, just with renamed variable names.

Sources

On the source I indeed intentionally mixed your (1) and (3) into "build", given that I don't know of any data source that provides both at the same time, but I see that it is more clear to separate them and accept that most datasets will have (2) and either (1) or (3), but not (1), (2) and (3).

Versions

The only benefit I see that we gain from having consistent version keys across sources is being able to get rid of Recency. Especially since we don't want to increase the version numbers simultaneously, i.e. one would have datasets that are v1.0.0, some that are v1.0.1 or v4.0.0.

  • The downside I believe is that it requires more effort to compare our data with the primary sources version names, e.g. if we rename GEM's April-2024-V1 to v1.0.0 we are obfuscating their version number.

I'd rather keep the primary source's version names.

Recency

I introduce this column to help me find the "latest" version of a dataset, since the version is not guaranteed to be sorted or semantic versioning due to the different methods the primary data providers may use for version naming. Then I realised that it has additional value to mark whether the model is still compatible with a dataset, and mark e.g. "old" or "deprecated/incompatible" versions, and what the current intended/supported version is. I.e. you can keep "latest" in the config.yaml and get an auto-update of a dataset if you upgrade between PyPSA-Eur versions, without having to check whether a new version of the dataset is available and whether you need to update your configfile.

I think it would be nice to keep it, for look up purposes only and not for indexing of the file, where instead of specifying the version in the configfile, one provides the recency. Happy to rename, just not to "tag" - that does not seem descriptive enough for me. What do you think?

To summarize ...

I'd go with something like this:

data/versions.csv:

dataset source version recency
GEM_GSPT primary Febuly-2999-V1 unstable / nightly / untested
GEM_GSPT primary April-2024-V1 latest
GEM_GSPT primary January-1970-V1 deprecated
GEM_GSPT primary January-2000-V1 outdated
GEM_GSPT archive April-2024-V1 latest
GEM_GSPT archive January-1970-V1 deprecated
GEM_GSPT archive January-2000-V1 outdated
... ... ... ...
OSM build build unstable / nightly / untested
OSM archive 0.7 unstable / nightly / untested
OSM archive 0.6 latest
OSM archive 0.1 deprecated
... ... ... ...
WDPA primary primary unstable / nightly / untested / we don't have anything better or an archived version
  • all datasets are downloaded to data/<dataset>/<version>/
  • config.yaml will have
datasets:
  <dataset>:
    source: "primary" | "archive" | "build"
    version: "<a version from versions.csv>" | "" # either version or recency need to be specified
    recency: "" | "latest" | "nightly"                        # either version or recency need to be specified
    

euronion avatar May 15 '25 18:05 euronion

Update after some discussions:

For data/versions.csv we will go with 6 entries:

  • dataset : name of the dataset
  • source: one of either primary | build | archive determining whether it is retrieved from the original data provider (primary), build based on the original data source, e.g. OSM (build) or an archived version retrieved from our mirror on e.g. Zenodo (archive)
  • version: Name of the version following the versioning schema of the original data provider. If the original data provider does not have a versioning schema, we'll go with a pragmatic version name, e.g. the date YYYY-MM-DD the data was retrieved and the archived version was created.
  • tags: a list of different tags that we support. For now, the only one is latest-supported, that refers to the latest version of a dataset that is supported by the model. latest-supported needs to be bumped when creating a new version of a dataset and putting it into the file. tags options for the future envisioned are e.g. nightly or latest.
  • supported: A flag either TRUE or FALSE indicating whether the current model version supports this dataset. We'll not actively monitor or test for compatibilities, the intention here is to provide indicate when a new version of a dataset was added, whether the previous version is just outdated or maybe the data schema/contents changed and is therefore no longer compatible and supported by the model.
  • URL: URL pointing to the resource for download.

Further:

  • Downloaded data will be located in dedicated subfolders data/<dataset>/<source>/<version>/, allowing for clear separation of any dataset.
  • If the primary or build source allows for downloading continuously updated data without a versioning schema, e.g. OSM, then the version to use by convention is 'unknown'`
  • In the config file, we specify the data using source and version for each dataset. version is a valid version from the .csv, with the special version name of latest-supported that get's resolved to the version of the dataset with this particular tag. This version should be the default for most users, as this way they always get the newest data that is compatible with the model after upgrades, without loosing previous datasets should they desire to switch back or compare.

euronion avatar May 19 '25 12:05 euronion

@lkstrp @coroa @SermishaNarayana :tada:

This PR is now RTR. Comments welcome; for open TODOs / discussion points see above.

euronion avatar Sep 26 '25 15:09 euronion

In the previous failed test run (now restarted), the webarchive timed out. I was manually able to access the dataset through the browser, I hope this is not a reoccuring issue if an archived link is not accessed repeatedly in the webarchive.

edit: NVM, there was a small issue with the web archive links.

euronion avatar Sep 29 '25 06:09 euronion

@coroa suggested wrapping all storage(...) calls to Zenodo into ancient(...) to prevent accidental retrieval with changed mtime on Zenodo. Necessary?

Yes, necessary. Consider as an example: shipdensity_global.zip, the url for it is: "https://zenodo.org/records/13757228/files/shipdensity_global.zip", which is still from Record#13757228 published September 13, 2024 as part of Version v0.4.1 of the databundle.

But if you ask zenodo when the file was last modified:

❯ curl --head "https://zenodo.org/records/13757228/files/shipdensity_global.zip"
HTTP/1.1 200 OK
server: nginx
content-type: application/octet-stream
content-length: 534907254
[...]
last-modified: Wed, 03 Sep 2025 17:48:45 GMT
[...]

And the http storage provider uses that to determine the mtime (code) and re-downloads, even though he still has it in

❯ ls -l .snakemake/storage/http/zenodo.org/records/13757228/files/shipdensity_global.zip
-rw-rw-r-- 1 coroa coroa 534907254 Aug 13 19:15 .snakemake/storage/http/zenodo.org/records/13757228/files/shipdensity_global.zip

(here from Aug 13 when i last let him download this exact same file).

The ancient flag (docs) means that mtime differences can be ignored.

This is special for zenodo because the data for a single record is not allowed to change after it is published, a new version introduces a new record and thus a new url.

coroa avatar Oct 01 '25 10:10 coroa

TBH I don't understand what the storage plugin is doing sometimes. I had an idea, but I can't test it, because I cannot get snakemake to trigger on its own on an outdated file:

I ran

> snakemake -c1 retrieve_ship_raster -f

to download the raster, checking the last-modified date on Zenodo (it is a different record than yours):

❮ curl --head https://zenodo.org/records/16894236/files/shipdensity_global.zip
HTTP/1.1 200 OK
server: nginx
content-type: application/octet-stream
content-length: 534907254
content-security-policy: default-src 'self' fonts.googleapis.com *.gstatic.com data: 'unsafe-inline' 'unsafe-eval' blob: zenodo-broker.web.cern.ch zenodo-broker-qa.web.cern.ch maxcdn.bootstrapcdn.com cdnjs.cloudflare.com ajax.googleapis.com webanalytics.web.cern.ch
x-content-type-options: nosniff
x-download-options: noopen
x-permitted-cross-domain-policies: none
x-frame-options: sameorigin
x-xss-protection: 1; mode=block
content-disposition: attachment; filename=shipdensity_global.zip
last-modified: Mon, 18 Aug 2025 12:17:24 GMT
date: Wed, 01 Oct 2025 11:50:50 GMT
link: <https://zenodo.org/records/16894236> ; rel="collection" ; type="text/html" , <https://zenodo.org/api/records/16894236> ; rel="linkset" ; type="application/linkset+json"
x-ratelimit-limit: 133
x-ratelimit-remaining: 131
x-ratelimit-reset: 1759319511
retry-after: 60
permissions-policy: interest-cohort=()
strict-transport-security: max-age=31556926; includeSubDomains
referrer-policy: strict-origin-when-cross-origin
set-cookie: session=c159929825985543_68dd159a.oRdWapR-KXgfWRQFc_4JrErGejA; Expires=Mon, 06 Oct 2025 11:50:50 GMT; Secure; HttpOnly; Path=/; SameSite=Lax
strict-transport-security: max-age=15768000
x-request-id: 0f43d6d3a9a649f00545d6babd0b5443
set-cookie: 5569e5a730cade8ff2b54f1e815f3670=90e4e7f47bd8eac1a5a7440275b16b80; path=/; HttpOnly; Secure; SameSite=None
cache-control: private

now touching the output and the storage file to make have it's mtime older than the Zenodo record as I want snakemake to trigger

❮ touch -d "9 weeks ago" data/ship_raster/archive/v5/shipdensity_global.zip
❮ touch -d "9 weeks ago" .snakemake/storage/http/zenodo.org/records/16894236/files/shipdensity_global.zip

But it is not rerunning the workflow. When I request the same file again, it tells me instead:

❮ snakemake -n -c1 retrieve_ship_raster
[...]
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

Back to my idea, you can probably tell me if it is a possible workaround: Instead of wrapping everything in ancient(..), can we set in the config

storage:
    provider="http",
    ...
    # Whether the storage provider supports HTTP HEAD requests.
    supports_head=False,

My understanding from he code you shared about the storage plugin is, that without the HEAD information it will set the mtime to 0. Meaning it should not rerun, right?

euronion avatar Oct 01 '25 12:10 euronion

  1. Let me play with your idea for a sec, i expect that snakemake's .snakemake/metadata storage system intercepts with it.
  2. I would be very careful with supports_head, my reading of the code is that when the storage db requests the mtime it actually downloads the full file already, and then after determining the mtime is old downloads it again. But did not test.

coroa avatar Oct 01 '25 12:10 coroa

  1. Let me play with your idea for a sec, i expect that snakemake's .snakemake/metadata storage system intercepts with it.

Grr... snakemake seems to have some new optimisation that i don't understand. it currently cleans up the .snakemake/storage directory after each run even though i have keep_local=True. But okay.

Tests on master:

❯ snakemake -c1 retrieve_ship_raster -f
[...]
Building DAG of jobs...
Retrieving .snakemake/storage/http/zenodo.org/records/13757228/files/shipdensity_global.zip from storage.
Retrieving from storage: https://zenodo.org/records/13757228/files/shipdensity_global.zip
[...]

As said, while i do have .snakemake/storage during the download, it is removed after the snakemake run finishes, but that is not much of an issue (since the file is available as data/shipdensity_global.zip (on master)).

❯ ls -l data/shipdensity_global.zip 
-rw-rw-r-- 1 coroa coroa 534907254 Oct  1 15:20 data/shipdensity_global.zip

If i do re-run, it is happy:

❯ snakemake -c1 retrieve_ship_raster -n
[...]
Nothing to be done (all requested files are present and up to date).

If i set the timestamp to before the August last-modified time, it wants to redownload:

❯ touch -d "9 weeks ago" data/shipdensity_global.zip
❯ snakemake -c1 retrieve_ship_raster -n             
[...]
[Wed Oct  1 15:26:08 2025]
rule retrieve_ship_raster:
    input: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
    output: data/shipdensity_global.zip
    log: logs/retrieve_ship_raster.log
    jobid: 0
    reason: Updated input files: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
    resources: tmpdir=<TBD>, mem_mb=5000, mem_mib=4769
[...]

If i wrap with ancient(storage(..., keep_local=True)) in retrieve.smk:

❯ snakemake -c1 retrieve_ship_raster -n
[...]
Nothing to be done (all requested files are present and up to date).

With:

storage:
    provider="http",
    keep_local=True,
    # Whether the storage provider supports HTTP HEAD requests.
    supports_head=False,

it again tries to download, although the time is not long enough to suggest it did a full download before deciding, so i am unsure what the internals do:

❯ snakemake -c1 retrieve_ship_raster -n
[...]
[Wed Oct  1 15:32:08 2025]
rule retrieve_ship_raster:
    input: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
    output: data/shipdensity_global.zip
    log: logs/retrieve_ship_raster.log
    jobid: 0
    reason: Updated input files: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
    resources: tmpdir=<TBD>, mem_mb=5000, mem_mib=4769

coroa avatar Oct 01 '25 13:10 coroa

If you don't like the look of:

rule ...:
    input: ancient(storage("http://zenodo.org/records/.../files/shipdensity_global.zip"))

then how about:

def zenodo(url):
     return ancient(storage(url, keep_local=True))

rule a:
     input: zenodo("http://zenodo.org/records/.../files/filea.ext")

rule b:
    input: zenodo("http://zenodo.org/records/.../files/fileb.ext")

coroa avatar Oct 01 '25 13:10 coroa

If you don't like the look of:

rule ...:
    input: ancient(storage("http://zenodo.org/records/.../files/shipdensity_global.zip"))

then how about:

def zenodo(url):
     return ancient(storage(url, keep_local=True))

rule a:
     input: zenodo("http://zenodo.org/records/.../files/filea.ext")

rule b:
    input: zenodo("http://zenodo.org/records/.../files/fileb.ext")

I like this, thanks for the suggestion. I'd make the following modification, such that we use only the "auto" storage provider everywhere, as the URL could be from Zenodo (archive case) or a different location (primary case):

def http_storage(url, **kwargs):
    import urllib

    # Zenondo sometimes returns a "last-modified" date in the header that seems like the underlying
    # file has been modified recently which would trigger a re-download, even though the file itself
    # has not changed (Zenodo URLs for files are immutable; a new version gets a new URL).
    # Use the "ancient" wrapper to ignore the last-modified date for Zenodo URLs.
    if "zenodo.org" in urllib.parse.urlparse(url).netloc:
        return ancient(storage(url, **kwargs))
    else:
        return storage(url, **kwargs)

If you're happy with this, we can ask @SermishaNarayana to implement it like this.

euronion avatar Oct 06 '25 14:10 euronion

def http_storage(url, **kwargs):
    import urllib

    # Zenondo sometimes returns a "last-modified" date in the header that seems like the underlying
    # file has been modified recently which would trigger a re-download, even though the file itself
    # has not changed (Zenodo URLs for files are immutable; a new version gets a new URL).
    # Use the "ancient" wrapper to ignore the last-modified date for Zenodo URLs.
    if "zenodo.org" in urllib.parse.urlparse(url).netloc:
        return ancient(storage(url, **kwargs))
    else:
        return storage(url, **kwargs)

If you're happy with this, we can ask @SermishaNarayana to implement it like this.

Sure, makes sense. @SermishaNarayana If you can think of a shorter name to carry the same meaning, i'd go with it; but otherwise let's go as is.

coroa avatar Oct 06 '25 19:10 coroa

CI fails occasionally because of time outs received from the web archive. Not clear why, but the timeouts are not persistent. Can probably be fixed.

  • [ ] Fix timeouts from web archive

euronion avatar Oct 09 '25 10:10 euronion

I'm not sure what the snakemake problem is, so I'll also summon @coroa into this issue:

The CI failing for MacOS encounters a problem that we regularly see in the CI for retrieval using storage(...) from Zenodo. It is not OS specific:

Failed to check existence of https://zenodo.org/records/16965042/files/kfz.csv
SSLError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /records/16965042/files/kfz.csv (Caused by SSLError(SSLCertVerificationError(1, "[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'zenodo.org'. (_ssl.c:1010)")))
make: *** [test] Error 1

while this sounds like a problem with Zenodo, the data is actually retrieved before earlier and apparently successfully here.

The certificate also seems to match, at least before the workflow is executed, I added this check for debugging this specific problem.

The problem is transient. It sometimes appears and is sometimes gone.

euronion avatar Oct 10 '25 13:10 euronion