V1x and V2x address 2 fundamentally different but both valid forms of annotating data
Description of the issue
As a "Legacy user" one of the most important changes from 1 to 2 is the allocation of the contextual metadata, the content of general keys. First of all, I think it is a good thing that we have the possibility of doing fine grained annotation at the resource level. I think in most applications that should be the way to go.
But I also it is important to acknowledge that there are data resources that are comprised of multiple tables or files, and only make sense in conjunction. And when it comes to annotate them, following the approach of v2x will mean a lot of not necessary repetition of information.
To address this we need general guidelines on what to do in these cases or decide on continuing the maintainability of the v1 line.
Ideas of solution
My ideas for the solution are multiple:
- When a data package has multiple resources that carry the same metadata the user has to duplicate the metadata for each resource. -> Repetitive and convoluted files.
- ... The user is told to write the metadata in the first instance to avoid repetition. -> Potentially confusing and non-transparent
- Point the user to use v16+ -> Clean alternative but would mean commiting to maintain two metadata formats.
- Extend v20 to have optionally of having shared general keys in the top level like v1x -> Larger schema file, potentially confusing.
I tend more towards option 4 because it would mean less overhead in the long run. What are your thoughts?
Workflow checklist
- [x] I am aware of the workflow in CONTRIBUTING.md
Thank you for this great feedback. This is the most important change in the metadata structure. Here are some comments on your ideas:
- There is a German saying "Doppelt hält besser". In previous version different information could not be added at all to different resources in a data package. You found the disadvantage!
- The design and usability of the OEMetadata Builder is not finished yet. This is an issue we will consider!
- We will definitely deprecate the v160 version. This is not an option.
- Since we want to be conform to Data Package, the allocation of the information on the resource level is necessary. An addition on the Dataset (data package) level is not possible.
The structural update in OEMetadata 2.0 creates the possibility to have detailed information for each resource in a data set. This follows the Data Package 2.0 standard. The obvious downside is a possible repetition of information if the resources are similar.
Our goal is to improve the user interface of the OEMetadata Builder on the OEP to handle this. In addition, we are planning to add python functions to OMI to handle multiple metadata files and data packages.
Your feedback is welcome and we will publish the drafts and ideas online as usual.
4. Since we want to be conform to Data Package, the allocation of the information on the resource level is necessary. An addition on the Dataset (data package) level is not possible.
Just from looking at the schema of Datapackage 2.0, there are no restrictions on extending the metadata on the package level. Indeed they do have some repeated keys between the package and resource schema. Luckly none of them schemas are restrictive, so they are easily extendable. Until now I have been able to validate the v160 directly with frictionless by using:
frictionless validate .\some-oemetadata.json --type package --schema https://raw.githubusercontent.com/OpenEnergyPlatform/oemetadata/refs/heads/production/oemetadata/v1/v160/schema.json
And there was no compatibility issues I can imagine with v20 it will also be the case, but the point I want to make is that datapackage is not restrictive at all on what or what not is allowed in it, as long as the content is concretized in the schema. Frictionless has become surprisingly flexible and the open knowledge foundation has always someone working on it, and given the size of our community I bet they would be willing to listen if we have suggestions (in case something is really problematic). I already tried to make them aware that it would be cool if we could build on their existing tools.
PS:
From: https://datapackage.org/standard/data-package/
The Data Package specification does NOT impose any requirements on their form or structure and can therefore be used for packaging any kind of data.
We just discussed advanced functions for the OEMetadata Builder to improve usability for large data packages.
- Implement a "template resource" that will be used by all resources
- Implement a "copy from resource 1" function for each section
This will give complete flexibility and simplify the handling.
A little late to the party, but here is what i get into my mind reading this issue:
Extending the suggesting from our discussion as posted above: To keep the oemetadata to be usable outside the oeplatform we would need a better solution. Currently, this sounds like it might need a specific profile for all use cases. Like Datasets - set of more coupled data resources and another profile for Datasets - set of various data resources. For simplicity, im also for only maintaining a single version until we have more resource to think about larger sets - if ever :) Further down the read i also describe an idea on how to add more automation but this would then create more oeplatform dependence again.
For more guidance, i guess we would have to formulate some use cases and make clear what tradeoffs are included currently.
As you already pointed out we can get inspiration from specifications we lean on: Looking at the data-package v2 properties they also use more properties to describe the dataset. We would have to add a few more options to our oemetadata schema to allow more properties. dcat-ap dataset list only title and description as mandatory properties, while offering a few recommended and more optional properties. dcat also reduces the redundant keyword but only by maintaining a more complex object based relational model. OEMetadata starting to become more modular but is not there yet. Additionally, we have not yet nicely implemented our fine-grained batch system to make the importance of any property more obvious
As it seems extensive to require only "hand filled" metadata we can help with automation - at least when using metadata on the oeplatform. We would have to add OEP-custom properties to the dataset level of the oemetadata. And maybe add the Subject annotation property to the dataset level?
The properties generated automatically from the information available in each resource could be:
- Last Updated:date Any edit action on the dataset or any resource
- Publisher:list All Publisher from each resource?
- Project:list All project names from each resource
- Licenses:list
As we know the license for each resource of the dataset, we can say it uses open license only if all licenses evaluate as open (use the SPDX open data license list / the enhanced version including like EU licenses. If some licenses are missing or if they do not hold a license, they affect the openness of the dataset. On the dataset level we would show License ->
-> show all licenses. The options are: open, leaning towards open, data access restricted and unclear as fallback. - Keywords:List could also come from all resources. Nevertheless maybe it would be better to add only a few more general ones or even use specific user choices instead.
I thought about this. I think the best way forward would be to make everything more modular. Frictionless also spits their spec into multiple parts of Package, Resource, Schema etc. . I'm not sure if we have to provide specific json schemas for each module or add the Profile property to become an official extension to the frictionless datapackage spec.
Maintaining both v2 and v16 seems not the best option.
Additionally, the OEP also needs modular metadata. E.g. when someone uploads new data then eventually add metadata they currently add the full v2 dataset focused metadata to a single table which is not perfect. We handle this internally, yet it would be cleaner to only have the resource information for a single table.