4499 - dataset version pids
What this PR does / why we need it:
This pull request adds the long awaited option to generate PIDs for dataset version.
TODOs:
- [ ] Add tests for JsonPrinter/JsonParser of Collection JSON
- [x] Implement API endpoints to set collection conduct (might need new engine command)
- [ ] Add tests for
DataverseServiceBean.wantsDatasetVersionPids() - [ ] Add
GlobalIdServiceBean.publicizeIdentifier(DatasetVersion datasetVersion)and implementations- [ ] DataCite (also needs metadata template adaption)
- [ ] PermaLink
- [x] Fake (stub only)
- [ ] Handle: leave unsupported for now
- [ ] EZID: leave unsupported for now
- [x] Add identifier generation for dataset versions (add generator to interface
GlobalIdServiceBeanand impl. classes) - [x] Extend DatasetVersion model to include the identifier (so we can generate random ones as extension)
- [x] ~~Think about pre-registring the version PID like we do for datasets/files~~ No - this doesn't make sense, as we never know a-priori if a version will be minor or major, and only major versions shall have a PID.
- Extend with creating and publicizing the identifiers:
- [ ]
UpdateDvObjectPIDMetadataCommand - [ ]
RegisterDvObjectCommand - [x]
FinalizeDatasetPublicationCommand
- [ ]
- [ ] Extend to modify existing version records with updates for minors (e.g. in
UpdateDatasetTargetURLCommand) - [ ] Extend Version UI to show the PIDs as links
- [ ] Extend Data Citation UI component to include the version DOI (Configuration?)
- [ ] Write out the version PID in Dataset JSON parts
- [ ] Parse the version PID in the JSON Parser for Datasets
- [ ] Adapt
DatasetVersion.getJsonLd()to indicate this is a version, use version identifier and add relation to dataset via concept PID - [ ] Extend Export API to accept PIDs of versions & export accordingly
- [ ] Write API tests for new endpoints
- [ ] Write tests to make sure this whole thing works as designed (it needs adaption of current tests as well)
- [ ] Docs for the API calls
- [ ] Docs for the extension and implied default in the Collection JSON / JPA model
- [ ] Docs for the configuration and default conduct
Which issue(s) this PR closes:
Closes #4499
Special notes for your reviewer: This is WIP.
Suggestions on how to test this: This is WIP.
Does this PR introduce a user interface change? If mockups are available, please link/include them here: Some minor additions. Will be included once avail.
Is there a release notes update needed for this change?: Yes, will be included.
Additional documentation: None yet.
Hi @mreekie, this is not yet ready to be drawn into any other column other than my own. I did add a size, but obviously this is not the size of the estimated work for me, but my expectation of time spent on review, testing and Q&A at IQSS. Hope that's alright!
coverage: 20.399% (+0.07%) from 20.332% when pulling fa8d544e86d8ccf314d0695f177153248765033a on poikilotherm:4499-version-doi into fd190a39d4d89051072d9788181a2a7763f2af69 on IQSS:develop.
Key takeaways from first feedback during tech hour:
- Put the business logic for no PID / major version only / major + minor into the decision in
wantVersionPid()and make it a three way thing. Don't put the decision into the PID provider. - Think about overrides: current design says collection decides. The storage option for example can be overridden per dataset via API. Do we need this here?
- Tap into moving a dataset to new owner: give warning / abort / react to the new owner not having versions enabled when the dataset has identifiers for versions (which would be very quite trap)
- Give some thought into possibilities of "going back" or at least send warnings if for example the version was publicized but the dataset was not, which creates inconsistencies.
- Let the idea of "make this PR small scoped and tucked away behind feature flag" instead of "create a one big boom PR madness" sink in and discuss again later.
Thanks @qqmyers @pdurbin and everyone else for commenting and sharing your thoughts!
After an elaborate chat with @landreev yesterday (thanks again!), let me write down our key insights we got out of it:
- Make the JVM setting
dataverse.pid.version.modethree way:off(default),allow-major,allow-minor. That way, a sysadmin can restrict choice in collections, as this is cost related. - Enable collection curators to choose for a collection:
inherit(default),skip,major,minor. This way they inherit from parent collection, have an opt-out (skip) and can choose between major only or also create for minors. The selection will be limited by the global setting. - We'll change the order of creating and publicizing version PIDs to reduce the risks of incoherence at DataCite and database.
- We will create the PID after all file PIDs are publicized but before we publicize the dataset PID. This might be a long list and the chance of things going sideways is highest there.
- We will publicize the version PID after the dataset PID. If that update goes sideways, the version PID is registered, but not yet charged for and can be reused on next publication attempt. We can already include the version PID in the dataset PID metadata because we created it before (see 1).
- There is a tricky edge case. Assume version PIDs to be a suffix of the dataset PID. If someone tries to publish and says "major update" and things go sideways before the version PID is done but after the dataset PID is publicized, on the next publication attempt they might choose "minor update". As we did not yet publicize the major one, we can create a minor one and go with it. The major one can be reused later (it's not findable/resolvable anyway). The business logic to create an identifier for a version using suffixes needs to be OK with a PID already existing. (This is different for datasets, where we need to create a new identifier!!!)
- We were in mutual agreement that we should make this PR smaller scoped. This PR should only be about the "inner core", enabling the general concept to be realized. It should not yet touch the UI, unnecessary API changes or PID provider support. This way, reviewing and quality assurance are much easier to handle without endangering a release hitting the public. Any documentation included in this PR should state clearly this is not yet ready and should only be used in development and experimentation.
A couple minor points:
- you say 'off' in 1, and 'skip' in 2 - those are the same thing?
- ~3.2, 3.3 - PIDs are only non-deletable once they are publicized (I think that's true for all current PID providers) so I don't think there's ever a need to keep an un-published PID around after a failure, i.e. those can just be deleted, and new ones or the same PID recreated later if/when needed. One would have to make those calls to delete as part of a rollback action. Version PIDs are definitely a new case as the major/minor status and whether those categories get PIDs can change.
Thanks for continuing the feedback @qqmyers!
Ad 1)
Yes, on purpose. The selection in the collection will be disabled if the feature is turned off globally (by an admin). No one can turn it on by accident.
"Skip" on the other hand is an opt-out. If your parent collection has it on, you might want to turn it off for a certain subcollection.
The decision on the collection is by curators! This is a difference to selecting storage per collection.
Ad 3.3)
Yes, one could try to make that happen in rollback. Let me add a note here that we don't do this for dataset PIDs or File PIDs, so maybe it's fine not to do it for version as well.
Also, in case things go sideways because the provider is out shopping 500s, it might be hard to ask for a deletion.
Of course, we could add a service that retries deleting failed registrations. Might be neat for dataset PIDs and file PIDs as well. But sounds like beyond scope for this PR to me.
re: 1) I'm still a bit confused - I understand off/no version PIDs allowed as a setting (though just not having setting defined could be 'off'), but does the setting then enforce whether major/minor pids are allowed at all? If not, the setting can just be binary. If it does, the the curator option either shouldn't allow minor if minor isn't turned on, etc. re 3: I agree rollback could fail. My point was not to add scope though, Instead, I thought the difference from dataset and file pids could help you avoid having to deal with the version corner cases. Removing reserved PIDs is done when the dataset/file is deleted and isn't needed in a rollback because they don't go away if publish fails, whereas, as you point out, the version as a fixed vX.Y does go away if publication fails (and the maleable draft version remains). Because of all that, I think if you basically remove the local PID entry for a draft version in rollback (and try to tell the PID provider), you can avoid having to track it - just regenerate a new one when you publish again. (If the PID provider call fails, there would still be an abandoned registered PID that could/should be removed someday). Hopefully that would save you time/effort relative to handling the corner cases.
@poikilotherm this PR is on your wishlist for 6.4 but there are a lot of merge conflicts.
Also, do you still feel that a size of 30 is right?
Thanks.
We are also interested in this feature. I saw above this feature was (potentially) planned for 6.4. Is it now planned for a later version?
2025-03-31
- @poikilotherm are you still working on this draft? Please let us know what we should do with it. Thanks!