data: Intermediate layer for versioning of datasets
This PR suggest an implementation method for an intermediate data layer, that allows to retrieve a specific version of a dataset, rather than always the latest.
The current status is implemented on the OSM dataset, as this one was already tracking versions and working very similarly. I'll expand it to two more datasets in the upcoming days.
Motivation
Upstream data dependencies are not always versioned or fixed, meaning they may change unexpectedly without a way to revert to a different version. This causes reproducibility issues.
Approach
This PR solves the problem in the following way:
- We create archived versions of all external datasets (if we are allowed -> question of licensing) on e.g. Zenodo
- The URL for retrieving each combination of (dataset x version) is stored in
data/versions.csv. This allows us to switch to a different data plattform or provider if necessary; or use a versioned URL directly from a data provider if available. data/versions.csvalso records the license and the description for the dataset. I plan on utilising this information to automatically create new versions of datasets and distribute the license text + metadata information based, as well as utilise this file for the documentation here- I imagine that all externally retrieved data will get two rules: (1) a rule for retrieving the
upstreamversion, which I calledsource: "build"in theconfig.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.
TODO
- [ ] Implement for two more datasets
- [ ] Helper and instructions for creating a new version of a dataset from
upstreamtoarchive(small script/CLI tool) - [ ] Unbundle the data bundle on Zenodo
- [ ] Move datasets from
data/to Zenodo, keep files indatathat are manual specifications/inputs, e.g. asdata/manual - [ ] Update documentation to utilise
data/versions.csv
Comments are already welcome!
Checklist
- [ ] I tested my contribution locally and it works as intended.
- [ ] Code and workflow changes are sufficiently documented.
- [ ] Changed dependencies are added to
envs/environment.yaml. - [ ] Changes in configuration options are added in
config/config.default.yaml. - [ ] Changes in configuration options are documented in
doc/configtables/*.csv. - [ ] Sources of newly added data are documented in
doc/data_sources.rst. - [ ] A release note
doc/release_notes.rstis added.
Now implemented for the Worldbank Urban Population dataset.
This dataset uses the same method for retrieving from WB as for retrieving from Zenodo (sandbox link for now), the structure also suits itself to providing the upstream information in the data/versions.csv file.
@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!
Thanks @coroa , you can now have a closer look! Let's also have another chat on how it overlaps with your activities.
I've updated the code, now data/versions.csv is included.
It also includes a third dataset - the GEM GSPT (Global Steel Plant Tracker by GEM), which I have chosen as a third example, because:
- It shows how the structure allows us to accommodate external data versioning activities, i.e. GEM does dedicated links to different versions of a dataset, so we don't necessarily need to create a Zenodo mirror (for other reasons we still should for this dataset)
- It shows how we can track unsupported versions of a dataset in the
data/versions.csv, i.e. I have added the newest version of the GSPT as "upstream" and "not supported", because in the new version the file format of the data changed and is no longer compatible with the current workflow. This can also be used to mark datasets, that are no longer compatible as "deprecated"
Finally,
-
I've opted to rename the output file of the GSPT, such that the version is only encoded in the folder, not in the file name (for easier switching to new versions) here https://github.com/PyPSA/pypsa-eur/blob/27534f256c2c206863e0dd82cbcc99512fdaa660/rules/retrieve.smk#L429
-
And to showcase how we can avoid clutter/bugs with the ever more complicating dependencies between rules, I've used the
rules.<rule_name>.output["<output_name>"]reference of the GEM GSPT instead of specifying the file name explicitly. We could use this approach instead of manually specifying the paths + versions each time: https://github.com/PyPSA/pypsa-eur/blob/27534f256c2c206863e0dd82cbcc99512fdaa660/rules/build_sector.smk#L696
@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The
data/versions.csvyou are mentioning is not part of the branch yet. Can you please add it!
Recording an idea:
With the structure I propose above, we have a dedicated folder for each data input and version. This would be a good place to store a copy of the LICENSE for that particular dataset as well as a metadata.json.
E.g.
data/worldbank_urban_population
├── 2025-05-07
│ ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.csv
│ ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.zip
│ ├── LICENSE
│ ├── Metadata_Country_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│ ├── Metadata_Indicator_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│ └── metadata.json
│ ...
For any data that we store on Zenodo, we can add them to the Zenodo repo. For datasets that we don't put on Zenodo or that are from upstream, we'd require a different solution for getting/storing the metadata and LICENSE.
Noting @lkstrp that this structure should also allow us to easily exchange "Zenodo" for any other type of data repo that allows for direct access to files, including S3 buckets.
Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.
Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within
./data. Probably moving all repo data just to./data/repoor./data/gitor similar.
We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.
Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within
./data. Probably moving all repo data just to./data/repoor./data/gitor similar.We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.
Seconding the latter idea. Rather than splitting it, removing it from the repo.
We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.
Seconding the latter idea. Rather than splitting it, removing it from the repo.
Oh yes, if we aim big, let's remove them. But I assume they have a reason to be there in the first place, and we cannot just move them all to Zenodo since the update frequency needs to be much lower there (?). The repo data is updated quite often and moving that to Zenodo will just be a huge pain and not the final solution. Either we come up with some S3 storage in here already, keep it in here for now or use some temp solution (repo/ nextcloud).
This was not originally about update frequency, it was more about convenience and size constraints. Here you have updates of all files in the data directory over the last year:
data/agg_p_nom_minmax.csv, 1
data/ammonia_plants.csv, 2
data/attributed_ports.json, 0
data/biomass_transport_costs_supplychain1.csv, 1
data/biomass_transport_costs_supplychain2.csv, 1
data/cement-plants-noneu.csv, 1
data/ch_cantons.csv, 0
data/ch_industrial_production_per_subsector.csv, 1
data/custom_extra_functionality.py, 1
data/custom_powerplants.csv, 0
data/district_heat_share.csv, 2
data/egs_costs.json, 1
data/eia_hydro_annual_capacity.csv, 1
data/eia_hydro_annual_generation.csv, 1
data/entsoegridkit/README.md, 0
data/entsoegridkit/buses.csv, 0
data/entsoegridkit/converters.csv, 0
data/entsoegridkit/generators.csv, 0
data/entsoegridkit/lines.csv, 0
data/entsoegridkit/links.csv, 1
data/entsoegridkit/transformers.csv, 0
data/existing_infrastructure/existing_heating_raw.csv, 1
data/gr-e-11.03.02.01.01-cc.csv, 0
data/heat_load_profile_BDEW.csv, 0
data/hydro_capacities.csv, 0
data/links_p_nom.csv, 1
data/nuclear_p_max_pu.csv, 1
data/parameter_corrections.yaml, 1
data/refineries-noneu.csv, 1
data/retro/comparative_level_investment.csv, 0
data/retro/data_building_stock.csv, 0
data/retro/electricity_taxes_eu.csv, 0
data/retro/floor_area_missing.csv, 0
data/retro/retro_cost_germany.csv, 0
data/retro/u_values_poland.csv, 0
data/retro/window_assumptions.csv, 0
data/switzerland-new_format-all_years.csv, 0
data/transmission_projects/manual/new_links.csv, 2
data/transmission_projects/nep/new_lines.csv, 2
data/transmission_projects/nep/new_links.csv, 3
data/transmission_projects/template/new_lines.csv, 1
data/transmission_projects/template/new_links.csv, 1
data/transmission_projects/template/upgraded_lines.csv, 1
data/transmission_projects/template/upgraded_links.csv, 1
data/transmission_projects/tyndp2020/new_lines.csv, 1
data/transmission_projects/tyndp2020/new_links.csv, 2
data/transmission_projects/tyndp2020/upgraded_lines.csv, 1
data/transmission_projects/tyndp2020/upgraded_links.csv, 1
data/unit_commitment.csv, 0
by
for i in $(git ls-files data); do echo $i, $(git log --oneline --since="1 year ago" ${i} | wc -l); done
And this is to frequent for Zenodo
And this is to frequent for Zenodo
The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?
And this is to frequent for Zenodo
The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?
5 commits for data/transmission_projects and 15 for the rest (out of which 2 should live in technology data, i guess).
Some of the 15 are deletions, some are within the span of a week. Still too many to handle manually i guess.
I don't think we should be handling any of it manually anyways. I was thinking of writing a small CLI script that helps to create new versions to Zenodo.
Not only to make it easier, but also to avoid passing mistakes.
Files like parameter_corrections, or NEP plans, deserve to be version-controlled since they are hand-written rather than imported.
So tracking them in a git repository would still be good practice for them. Maybe does not have to be directly in this repository, but also does not hurt.
Maybe a sub-directory like: data/manual, or a pypsa-eur-data-manual repository, but then this also needs to be maintained and version synced.
Small CLI script sounds good and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.
We need to reupload the whole directory for any new version on Zenodo. Zenodo cannot just update a single file of a bundle. So, if only one of 20 datasets needs an update, we have to reupload them all. This alone is an unpleasant misuse already. But all 20 of them get a new version tag as well, even if for 19 there is no difference between versions. So the whole purpose of versioning of datasets is also gone.
As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.
Small CLI script sounds good
- added as open TODO
and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.
Agreed. I wasn't thinking of moving the data from the repo into the data bundle. I was thinking about moving the data from the repo into dedicated Zenodo datasets. One Zenodo URL per standalone dataset. Not what we are doing now with the databundle.
We need to reupload the whole directory for any new version on Zenodo. Yes, and I don't want to repeat that either if we just want to update parts of the data.
As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support
latest/nightly. Zenodo is just not designed for this.
Keeping aside the tags, Zenodo is not build for having a single record contain multiple datasets. What I would be doing is create a dedicated record per dataset. In that case Zenodo is serving our purpose nicely. And since we use the storage(...) provider from Snakemake, we can always just provide a different URL if we want to switch to a storage bucket or another archive - they only need to provide version-specific direct URLs for accessing the datasets
Ok. If we create a single record for each record on Zenodo, I would still argue that this is an unnecessary overhead, but if you want to go for it, I'll give up my resistance. As you say, we can easily switch then 👍
This is lovely @euronion !
I have a couple of thoughts on the general schema: [...]
Thanks for the feedback @lkstrp - what I understand is that you only have concerns about the schema, but no comments or concerns about the implementation. Is that correct?
Naming
About your schema concerns: I wasn't very happy about my suggestions either, so happy to change them.
Indexing in data/sources.csv
I'm fine with indexing through dataset (or dataset_name, source, version); that's the status quo anyway, just with renamed variable names.
Sources
On the source I indeed intentionally mixed your (1) and (3) into "build", given that I don't know of any data source that provides both at the same time, but I see that it is more clear to separate them and accept that most datasets will have (2) and either (1) or (3), but not (1), (2) and (3).
Versions
The only benefit I see that we gain from having consistent version keys across sources is being able to get rid of Recency.
Especially since we don't want to increase the version numbers simultaneously, i.e. one would have datasets that are v1.0.0, some that are v1.0.1 or v4.0.0.
- The downside I believe is that it requires more effort to compare our data with the primary sources version names, e.g. if we rename GEM's
April-2024-V1tov1.0.0we are obfuscating their version number.
I'd rather keep the primary source's version names.
Recency
I introduce this column to help me find the "latest" version of a dataset, since the version is not guaranteed to be sorted or semantic versioning due to the different methods the primary data providers may use for version naming.
Then I realised that it has additional value to mark whether the model is still compatible with a dataset, and mark e.g. "old" or "deprecated/incompatible" versions, and what the current intended/supported version is.
I.e. you can keep "latest" in the config.yaml and get an auto-update of a dataset if you upgrade between PyPSA-Eur versions, without having to check whether a new version of the dataset is available and whether you need to update your configfile.
I think it would be nice to keep it, for look up purposes only and not for indexing of the file,
where instead of specifying the version in the configfile, one provides the recency.
Happy to rename, just not to "tag" - that does not seem descriptive enough for me.
What do you think?
To summarize ...
I'd go with something like this:
data/versions.csv:
| dataset | source | version | recency |
|---|---|---|---|
| GEM_GSPT | primary | Febuly-2999-V1 | unstable / nightly / untested |
| GEM_GSPT | primary | April-2024-V1 | latest |
| GEM_GSPT | primary | January-1970-V1 | deprecated |
| GEM_GSPT | primary | January-2000-V1 | outdated |
| GEM_GSPT | archive | April-2024-V1 | latest |
| GEM_GSPT | archive | January-1970-V1 | deprecated |
| GEM_GSPT | archive | January-2000-V1 | outdated |
| ... | ... | ... | ... |
| OSM | build | build | unstable / nightly / untested |
| OSM | archive | 0.7 | unstable / nightly / untested |
| OSM | archive | 0.6 | latest |
| OSM | archive | 0.1 | deprecated |
| ... | ... | ... | ... |
| WDPA | primary | primary | unstable / nightly / untested / we don't have anything better or an archived version |
- all datasets are downloaded to
data/<dataset>/<version>/ config.yamlwill have
datasets:
<dataset>:
source: "primary" | "archive" | "build"
version: "<a version from versions.csv>" | "" # either version or recency need to be specified
recency: "" | "latest" | "nightly" # either version or recency need to be specified
Update after some discussions:
For data/versions.csv we will go with 6 entries:
dataset: name of the datasetsource: one of eitherprimary | build | archivedetermining whether it is retrieved from the original data provider (primary), build based on the original data source, e.g. OSM (build) or an archived version retrieved from our mirror on e.g. Zenodo (archive)version: Name of theversionfollowing the versioning schema of the original data provider. If the original data provider does not have a versioning schema, we'll go with a pragmatic version name, e.g. the dateYYYY-MM-DDthe data was retrieved and the archived version was created.tags: a list of different tags that we support. For now, the only one islatest-supported, that refers to the latest version of a dataset that is supported by the model.latest-supportedneeds to be bumped when creating a new version of a dataset and putting it into the file.tagsoptions for the future envisioned are e.g.nightlyorlatest.supported: A flag eitherTRUEorFALSEindicating whether the current model version supports this dataset. We'll not actively monitor or test for compatibilities, the intention here is to provide indicate when a new version of a dataset was added, whether the previous version is just outdated or maybe the data schema/contents changed and is therefore no longer compatible and supported by the model.URL: URL pointing to the resource for download.
Further:
- Downloaded data will be located in dedicated subfolders
data/<dataset>/<source>/<version>/, allowing for clear separation of any dataset. - If the
primaryorbuildsource allows for downloading continuously updated data without a versioning schema, e.g. OSM, then theversionto use by convention is 'unknown'` - In the config file, we specify the data using
sourceandversionfor eachdataset.versionis a valid version from the.csv, with the special version name oflatest-supportedthat get's resolved to the version of the dataset with this particular tag. This version should be the default for most users, as this way they always get the newest data that is compatible with the model after upgrades, without loosing previous datasets should they desire to switch back or compare.
@lkstrp @coroa @SermishaNarayana :tada:
This PR is now RTR. Comments welcome; for open TODOs / discussion points see above.
In the previous failed test run (now restarted), the webarchive timed out. I was manually able to access the dataset through the browser, I hope this is not a reoccuring issue if an archived link is not accessed repeatedly in the webarchive.
edit: NVM, there was a small issue with the web archive links.
@coroa suggested wrapping all storage(...) calls to Zenodo into ancient(...) to prevent accidental retrieval with changed mtime on Zenodo. Necessary?
Yes, necessary. Consider as an example: shipdensity_global.zip, the url for it is: "https://zenodo.org/records/13757228/files/shipdensity_global.zip", which is still from Record#13757228 published September 13, 2024 as part of Version v0.4.1 of the databundle.
But if you ask zenodo when the file was last modified:
❯ curl --head "https://zenodo.org/records/13757228/files/shipdensity_global.zip"
HTTP/1.1 200 OK
server: nginx
content-type: application/octet-stream
content-length: 534907254
[...]
last-modified: Wed, 03 Sep 2025 17:48:45 GMT
[...]
And the http storage provider uses that to determine the mtime (code) and re-downloads, even though he still has it in
❯ ls -l .snakemake/storage/http/zenodo.org/records/13757228/files/shipdensity_global.zip
-rw-rw-r-- 1 coroa coroa 534907254 Aug 13 19:15 .snakemake/storage/http/zenodo.org/records/13757228/files/shipdensity_global.zip
(here from Aug 13 when i last let him download this exact same file).
The ancient flag (docs) means that mtime differences can be ignored.
This is special for zenodo because the data for a single record is not allowed to change after it is published, a new version introduces a new record and thus a new url.
TBH I don't understand what the storage plugin is doing sometimes. I had an idea, but I can't test it, because I cannot get snakemake to trigger on its own on an outdated file:
I ran
> snakemake -c1 retrieve_ship_raster -f
to download the raster, checking the last-modified date on Zenodo (it is a different record than yours):
❮ curl --head https://zenodo.org/records/16894236/files/shipdensity_global.zip
HTTP/1.1 200 OK
server: nginx
content-type: application/octet-stream
content-length: 534907254
content-security-policy: default-src 'self' fonts.googleapis.com *.gstatic.com data: 'unsafe-inline' 'unsafe-eval' blob: zenodo-broker.web.cern.ch zenodo-broker-qa.web.cern.ch maxcdn.bootstrapcdn.com cdnjs.cloudflare.com ajax.googleapis.com webanalytics.web.cern.ch
x-content-type-options: nosniff
x-download-options: noopen
x-permitted-cross-domain-policies: none
x-frame-options: sameorigin
x-xss-protection: 1; mode=block
content-disposition: attachment; filename=shipdensity_global.zip
last-modified: Mon, 18 Aug 2025 12:17:24 GMT
date: Wed, 01 Oct 2025 11:50:50 GMT
link: <https://zenodo.org/records/16894236> ; rel="collection" ; type="text/html" , <https://zenodo.org/api/records/16894236> ; rel="linkset" ; type="application/linkset+json"
x-ratelimit-limit: 133
x-ratelimit-remaining: 131
x-ratelimit-reset: 1759319511
retry-after: 60
permissions-policy: interest-cohort=()
strict-transport-security: max-age=31556926; includeSubDomains
referrer-policy: strict-origin-when-cross-origin
set-cookie: session=c159929825985543_68dd159a.oRdWapR-KXgfWRQFc_4JrErGejA; Expires=Mon, 06 Oct 2025 11:50:50 GMT; Secure; HttpOnly; Path=/; SameSite=Lax
strict-transport-security: max-age=15768000
x-request-id: 0f43d6d3a9a649f00545d6babd0b5443
set-cookie: 5569e5a730cade8ff2b54f1e815f3670=90e4e7f47bd8eac1a5a7440275b16b80; path=/; HttpOnly; Secure; SameSite=None
cache-control: private
now touching the output and the storage file to make have it's mtime older than the Zenodo record as I want snakemake to trigger
❮ touch -d "9 weeks ago" data/ship_raster/archive/v5/shipdensity_global.zip
❮ touch -d "9 weeks ago" .snakemake/storage/http/zenodo.org/records/16894236/files/shipdensity_global.zip
But it is not rerunning the workflow. When I request the same file again, it tells me instead:
❮ snakemake -n -c1 retrieve_ship_raster
[...]
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
Back to my idea, you can probably tell me if it is a possible workaround:
Instead of wrapping everything in ancient(..), can we set in the config
storage:
provider="http",
...
# Whether the storage provider supports HTTP HEAD requests.
supports_head=False,
My understanding from he code you shared about the storage plugin is, that without the HEAD information it will set the mtime to 0. Meaning it should not rerun, right?
- Let me play with your idea for a sec, i expect that snakemake's
.snakemake/metadatastorage system intercepts with it. - I would be very careful with
supports_head, my reading of the code is that when the storage db requests the mtime it actually downloads the full file already, and then after determining the mtime is old downloads it again. But did not test.
- Let me play with your idea for a sec, i expect that snakemake's
.snakemake/metadatastorage system intercepts with it.
Grr... snakemake seems to have some new optimisation that i don't understand. it currently cleans up the .snakemake/storage directory after each run even though i have keep_local=True. But okay.
Tests on master:
❯ snakemake -c1 retrieve_ship_raster -f
[...]
Building DAG of jobs...
Retrieving .snakemake/storage/http/zenodo.org/records/13757228/files/shipdensity_global.zip from storage.
Retrieving from storage: https://zenodo.org/records/13757228/files/shipdensity_global.zip
[...]
As said, while i do have .snakemake/storage during the download, it is removed after the snakemake run finishes, but that is not much of an issue (since the file is available as data/shipdensity_global.zip (on master)).
❯ ls -l data/shipdensity_global.zip
-rw-rw-r-- 1 coroa coroa 534907254 Oct 1 15:20 data/shipdensity_global.zip
If i do re-run, it is happy:
❯ snakemake -c1 retrieve_ship_raster -n
[...]
Nothing to be done (all requested files are present and up to date).
If i set the timestamp to before the August last-modified time, it wants to redownload:
❯ touch -d "9 weeks ago" data/shipdensity_global.zip
❯ snakemake -c1 retrieve_ship_raster -n
[...]
[Wed Oct 1 15:26:08 2025]
rule retrieve_ship_raster:
input: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
output: data/shipdensity_global.zip
log: logs/retrieve_ship_raster.log
jobid: 0
reason: Updated input files: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
resources: tmpdir=<TBD>, mem_mb=5000, mem_mib=4769
[...]
If i wrap with ancient(storage(..., keep_local=True)) in retrieve.smk:
❯ snakemake -c1 retrieve_ship_raster -n
[...]
Nothing to be done (all requested files are present and up to date).
With:
storage:
provider="http",
keep_local=True,
# Whether the storage provider supports HTTP HEAD requests.
supports_head=False,
it again tries to download, although the time is not long enough to suggest it did a full download before deciding, so i am unsure what the internals do:
❯ snakemake -c1 retrieve_ship_raster -n
[...]
[Wed Oct 1 15:32:08 2025]
rule retrieve_ship_raster:
input: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
output: data/shipdensity_global.zip
log: logs/retrieve_ship_raster.log
jobid: 0
reason: Updated input files: https://zenodo.org/records/13757228/files/shipdensity_global.zip (retrieve from storage)
resources: tmpdir=<TBD>, mem_mb=5000, mem_mib=4769
If you don't like the look of:
rule ...:
input: ancient(storage("http://zenodo.org/records/.../files/shipdensity_global.zip"))
then how about:
def zenodo(url):
return ancient(storage(url, keep_local=True))
rule a:
input: zenodo("http://zenodo.org/records/.../files/filea.ext")
rule b:
input: zenodo("http://zenodo.org/records/.../files/fileb.ext")
If you don't like the look of:
rule ...: input: ancient(storage("http://zenodo.org/records/.../files/shipdensity_global.zip"))then how about:
def zenodo(url): return ancient(storage(url, keep_local=True)) rule a: input: zenodo("http://zenodo.org/records/.../files/filea.ext") rule b: input: zenodo("http://zenodo.org/records/.../files/fileb.ext")
I like this, thanks for the suggestion. I'd make the following modification, such that we use only the "auto" storage provider everywhere, as the URL could be from Zenodo (archive case) or a different location (primary case):
def http_storage(url, **kwargs):
import urllib
# Zenondo sometimes returns a "last-modified" date in the header that seems like the underlying
# file has been modified recently which would trigger a re-download, even though the file itself
# has not changed (Zenodo URLs for files are immutable; a new version gets a new URL).
# Use the "ancient" wrapper to ignore the last-modified date for Zenodo URLs.
if "zenodo.org" in urllib.parse.urlparse(url).netloc:
return ancient(storage(url, **kwargs))
else:
return storage(url, **kwargs)
If you're happy with this, we can ask @SermishaNarayana to implement it like this.
def http_storage(url, **kwargs): import urllib # Zenondo sometimes returns a "last-modified" date in the header that seems like the underlying # file has been modified recently which would trigger a re-download, even though the file itself # has not changed (Zenodo URLs for files are immutable; a new version gets a new URL). # Use the "ancient" wrapper to ignore the last-modified date for Zenodo URLs. if "zenodo.org" in urllib.parse.urlparse(url).netloc: return ancient(storage(url, **kwargs)) else: return storage(url, **kwargs)If you're happy with this, we can ask @SermishaNarayana to implement it like this.
Sure, makes sense. @SermishaNarayana If you can think of a shorter name to carry the same meaning, i'd go with it; but otherwise let's go as is.
CI fails occasionally because of time outs received from the web archive. Not clear why, but the timeouts are not persistent. Can probably be fixed.
- [ ] Fix timeouts from web archive
I'm not sure what the snakemake problem is, so I'll also summon @coroa into this issue:
The CI failing for MacOS encounters a problem that we regularly see in the CI for retrieval using storage(...) from Zenodo. It is not OS specific:
Failed to check existence of https://zenodo.org/records/16965042/files/kfz.csv
SSLError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /records/16965042/files/kfz.csv (Caused by SSLError(SSLCertVerificationError(1, "[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'zenodo.org'. (_ssl.c:1010)")))
make: *** [test] Error 1
while this sounds like a problem with Zenodo, the data is actually retrieved before earlier and apparently successfully here.
The certificate also seems to match, at least before the workflow is executed, I added this check for debugging this specific problem.
The problem is transient. It sometimes appears and is sometimes gone.