OPTIMADE
OPTIMADE copied to clipboard
Suggestion of an "/archives" endpoint
At the 2021 workshop, we discussed including an OPTIONAL archives
entry type and corresponding endpoint in the specification. Below is an incomplete summary of the ideas that were discussed (please feel free to add/edit).
Other promoters: @sauliusg @jacksund
The idea is that this endpoint would serve static snapshots of an entire (as in, all endpoints) OPTIMADE implementation, potentially over subsets of the data (e.g., a particular set of materials).
This MUST be equivalent to what would be received by crawling an OPTIMADE API (in terms of format), and this could be represented as a hierarchical filesystem, e.g.
$ tree dump
dump
└── optimade.example.org
└── v1
├── calculations.json
├── info
│ ├── archives.json
│ ├── calculations.json
│ ├── links.json
│ ├── references.json
│ └── structures.json
├── links.json
├── references.json
└── structures.json
3 directories, 9 files
Potential attributes:
-
time_stamp
/last_modified
-
checksum
-
description
-
version
-
size
-
compression_method
-
url
Issues discussed
- Attribution: references endpoint is naturally included in the dump. Is this enough?
- Licensing: do we need to provide a mechanism for licensing databases differently to filtered data? Do we need to worry about this more generally?
- ACID: should it be an explicit requirement for serving archives?
- Indexing: completely lost, alongside any context provided by additional endpoints. Maybe defines the natural dividing line between an archivable database vs not.
- Implementation overhead: may require extra work to support, but for small databases it should be trivial. Should be no requirement on frequency of updates.
Enabling new use cases
- For databases that already provide archives, in some format:
- Improved findability, plus standardization of OPTIMADE
- Could remove some database load, e.g. could even replace pagination of “empty” filtering
- For smaller databases, easier to archive and easier to deal with for the end user
- For non-existent databases, e.g. a dataset on figshare… if provided as an OPTIMADE archive then allows exploration with OPTIMADE clients and hybrid OPTIMADE local clients/servers
- Archive-only databases, pointing to persistent long-term storage, indexed in the same way as the providers repository, e.g. GitHub repo that builds archives.optimade.org with defined prefixes.
Resources
Thanks for the write-up @ml-evs.
For smaller databases, easier to archive and easier to deal with for the end user
We would still need to make converting someone's data to OPTIMADE format easier. Right now, it would require providers to read the OPTIMADE spec and convert from their current format for structures (poscar, cif, etc.). I think it's worth adding .to_optimade()
methods for pymatgen Structure
and/or ase Atoms
classes. That way providers can automate conversion, regardless of initial structure format. These methods would also let the OPTIMADE absorb non-standardized datasets easily too.
When we go beyond just structures, this can be a lot of work though (e.g. even BandStructure
classes would also need a to_optimade
method)... This goes with your Implementation overhead
bullet point.
Attribution & Licensing
This is probably the biggest roadblock to archives. Making it optional should make things a lot easier though. I'd anticipate the larger and more well-known a database is, the less they'll want to participate in this endpoint.
Also what if we add license
to your list of attributes? So unique licensing would be attached to each individual archive dump.
Also the url
attribute can also be (optionally) provider-controlled. So a cdn with authentication, a link to their own website, etc. This would leave download stats in their hands.
Another route is collecting usage statistics that can be sent back to providers (for them use in future grant proposals). Users would have to agree to such data collection if they want to download an archive. I'm personally against data collection, but it might be a necessary compromise for some providers to participate. This would have to be implemented in the OPTIMADE client package too.
Could remove some database load
One potential issue is that the OPTIMADE spec doesn't aim to be a condensed format. Instead it shoots for being robust/encompassing/flexible. So we could actually end up with dump files that are larger than the ones providers make themselves. For example, I was able to get all MP structures into a dump file below 100MB -- but I don't think I can get anywhere close to that value using the OPTIMADE spec and json format.