metacatui
metacatui copied to clipboard
Allow users to reserve a DOI
Currently, users must rely on R or Python to reserve a DOI, but there’s been a request to add this capability directly in the UI. ESS-Dive already offers this feature so we should look into how they implemented it. This feature would be valuable for the Smithsonian community.
Considerations
- Include a repository-level setting to control who can reserve DOIs.
- Design how reserved DOIs can be applied to published datasets.
I think this is an excellent feature, and one useful to many repositories. The key infrastructure is already in place, and the main things we'd need to do is to:
- determine where to store the reserved identifier for later use
- I note that the CN.reserveIdentifier() and CN.hasReservation() methods are likely already implementing this kind of functionality on the CN side, which could be ported into the MN implementation in Metacat.
- determine who is able to use that identifier in a
create()orupdate()call to Metacat. Here are some possible pathways:
- user reserves
doi:10.0000/yyyyfor dataset withPID-1; same user comes back later and clicks 'publish` and the reserved DOI is used in an update() on the most recent PID in that version chain -
- user reserves
doi:10.0000/yyyyfor dataset withPID-1; same user comes back later and clicks **'request publish** which sends a message to the curation team; the curation team then can click 'publishwith the reserved DOI which is used in an update() on the most recent PID in that version chain
- user reserves
Who can do what should probably be configurable on a per-repository basis.
This is all related to the lifecycle workflow we discussed in #2205 -- here's a modified version of that workflow that might include some of these options:
From https://github.com/NCEAS/metacatui/issues/2205#issuecomment-1768409388
stateDiagram-v2
[*] --> Draft: New
Draft --> DOIReserved: reserveIdentifier
DOIReserved --> Draft
Draft --> Draft: Save
Draft --> [*]: Delete
review : In review
rev_request: Review requested
Draft --> rev_request: Review
rev_request --> review: StartReview
review --> Draft: RequestRevision
review --> Approved: Approve
Approved --> Published: Publish
Published --> Draft: Edit
Published --> [*]
@robyngit @mbjones
We use the MN /v2/generate endpoint to obtain a DOI from the configured Metacat DOI Service. For cases where a seriesId is used, the process is straightforward: the DOI returned can be directly associated with the dataset. Once the dataset is ready for publication, the MN /v2/publishIdentifier endpoint is then used to publish the dataset.
However, for the use case where a dataset is "re-published" with a new DOI (as is the default publication workflow in MetacatUI), the question arises: where do we store the DOI for later publication using /v2/publishIdentifier?
We are grappling with this issue on WFSI (wfsi-data.org). This challenge is particularly significant because we face difficulties managing datasets that have relationships with other datasets. For example, we link datasets using DataCite-related identifiers via EML dataset annotations. These datasets are typically part of the same research campaigns and are sometimes produced simultaneously.
Key Challenges:
- Order of Publication: We must carefully manage the publication order to avoid mistakes that necessitate minting a new DOI.
- Tracking Relationships: Managing relationships between datasets becomes complex, especially when multiple DOIs are involved throughout the dataset's lifecycle.
Example of Related Identifier Annotation
Proposed Solution: Pre-Minting DOIs
Pre-minting DOIs could help address these challenges. One idea is to use the <alternateIdentifier> element to store pre-minted DOIs, for example:
<alternateIdentifier>https://doi.org/10.XXX/1232424</alternateIdentifier>
We already apply a similar approach on ESS-DIVE when a user provides an existing DOI for their dataset that they want us to use instead of minting a new one via ESS-DIVE. However, this approach does not align well with a publication model that does not use seriesId, as a dataset can have multiple DOIs throughout its lifecycle.
Additional Considerations:
- Finding All DOIs for a Dataset: There is currently no straightforward way to track all DOIs minted for a particular dataset. A metadata-based solution could potentially address this.
- Tracking Dataset Changes: On WFSI, we use a
CHANGES.mdfile for data contributors to document changes to different dataset versions. Perhaps a similar mechanism could be introduced in EML to help track such changes.
Would love to hear your thoughts on this!
@vchendrix let's discuss this with @taojing2002 , but quickly addressing your question:
However, for the use case where a dataset is "re-published" with a new DOI (as is the default publication workflow in MetacatUI), the question arises: where do we store the DOI for later publication using /v2/publishIdentifier?
I think that publishIdentifier is called anytime that a create or update call is made on the API for a PID or SID that is a DOI in the configured shoulder list for the metacat instance. MetacatUI is not involved as far as I know, and so this ticket likely belongs in the Metacat repo, unless I am not understanding its scope. I don't understand the benefits of storing a pre-minted DOI in the alternateIdentifier field, as that would just confuse search as to which version is actually associated with the DOI. For example, imagine:
- User calls
createwith PIDA, and withpackageIdset toAandalternateIdentifierset todoi:10.xxxx/foo(at this point, the DOI will not resolve in DataCite/OSTI) - Indexing detects the alternateIdentifier and associates version
Awithdoi:10.xxxx/foofor search - User now calls
updatewith PIDdoi:10.xxxx/foowhich publishes the DOI, and the DOI now resolves - Indexing detects the new DOI pid in
packageIdand associates identifierdoi:10.xxxx/foofor search on this version (not the DOI is tied to two different versions of the object)
@vchendrix let's discuss this with @taojing2002 , but quickly addressing your question:
However, for the use case where a dataset is "re-published" with a new DOI (as is the default publication workflow in MetacatUI), the question arises: where do we store the DOI for later publication using /v2/publishIdentifier?
I think that
publishIdentifieris called anytime that acreateorupdatecall is made on the API for a PID or SID that is a DOI in the configured shoulder list for the metacat instance. MetacatUI is not involved as far as I know, and so this ticket likely belongs in the Metacat repo, unless I am not understanding its scope. I don't understand the benefits of storing a pre-minted DOI in the alternateIdentifier field, as that would just confuse search as to which version is actually associated with the DOI. For example, imagine:
- User calls
createwith PIDA, and withpackageIdset toAandalternateIdentifierset todoi:10.xxxx/foo(at this point, the DOI will not resolve in DataCite/OSTI)- Indexing detects the alternateIdentifier and associates version
Awithdoi:10.xxxx/foofor search- User now calls
updatewith PIDdoi:10.xxxx/foowhich publishes the DOI, and the DOI now resolves- Indexing detects the new DOI pid in
packageIdand associates identifierdoi:10.xxxx/foofor search on this version (not the DOI is tied to two different versions of the object)
Hey @mbjones given CN.reserveIdentifier() and CN.hasReservation(), I don't really understand how the reserved DOIs are associated with an unpublished dataset. We need a way to communitcate to the data contributors that the dataset has a reserved identifier. I was thinking that maybe it should live somewhere in the metadata of the dataset. We need this so that users who have access to a pre-published dataset are able to access the pre-minted doi for various reasons. Specifically for us it is to link datasets by related identifiers.
Pre-mint DOI with seriesId The process we use in ESS-DIVE (using seriesId) for preminting dois
- pre-mint doi: /v2/generate - returns a doi
- set doi as series id by updating the system metedata
- publish dataset using /v2/publishIdentifier
Pre-mint DOI The process I was thinking about for wfsi for preminted dois ( currently we use Publish with DOI in the UI). I have tested this approach and it seems to work well
- pre-mint doi: /v2/generate - returns a doi
- Manually save the association of the new DOI to the dataset (in a spreadsheet)
- Communitcate premited doi to data contributors who need to link their dataset
- When ready to publish, update dataset with the doi
- invoke /v2/publishIdentifer (this seems to make the dataset and associated files public, if they are not, I could be wrong) it would be nice if steps 2 and 3 could be automated.
Please correct me where I misunderstand! Thanks.
Let's discuss tomorrow, but yes, we do largely the same process you describe. There are some nuances about releasing pre-minted DOIs, and how publicly those should be distributed. Let's discuss a plan forward.
FYI -- the CN.reserveIdentifier() and CN.hasReservation() are CN operations and not really related to MN operations for generating identifiers. The two CN ops let a MN "reserve" an identifier so that there won't be a conflict with other MNs - identifiers are assigned on a first-come, first served basis as we harvest content at the CNs. Identifiers are just strings, and there is nothing technically preventing conflicts across MNs. Those methods have largely proven unnecessary because MNs pretty much operate in their own namespaces, with the exception of MNs that redistribute content for other organizations, which is a potential source of conflict. Let's chat.
What is the reason why users need another way to publish the package? Because they don't want the package be public readable even though the package has been assigned a DOI? If this is the case - we can leverage our feature to configure the property of guid.doi.autoPublish. If the value is false, the package will be still private even after assigned a doi. If users want to publish the package when it is ready, they can make the api call of publishIdentifier - the feature which our current MetacatUI doesn't have.
Also, if users really want to reserver a DOI - we can just provide a simple ui - a reserve button which return a DOI back (just allow admin to do it?) . Users can use the DOI in the ways whichever they want. But we don't need to provide a MetacatUI to associate the DOI to any package. So this can make things easy.
@taojing2002 If I set guid.doi.autoPublish=false and request a bunch of DOIs for private datasets (in DRP for example) will EZID report those links as broken until they are given public read? I expect this will be a very heavily used feature for Smithsonian and DRP, and possibly other HRs.
I am not sure if this is the place to mention this but the reason most ESS-DIVE data contributors want to get a preminted doi is because they need to provide it for a publication in review.
One other thing that publishers want is to get anonymous access for reviewers to a dataset (without making it public) in preparation which we cannot do right now. Do you have any thoughts on this?
@vchendrix, @iannesbitt and I chatted about how this could be implemented, and couldn't come up with a design that would give anonymous, read-only access without exposing reviewer identities or the dataset. I assume some sort of "anyone with the link can view" feature is what you're imagining? Something like this would be really useful, but it would require changes on both the Metacat and MetacatUI sides. The only work around I can think of that would be usable right now would be to 1. create a temporary account, 2. give that account view access to the dataset, and 3. share the login with the reviewers. Later the account could be removed. Open to other ideas & suggestions! We should create a new issue if it's worthwhile to discuss further.
@vchendrix, @iannesbitt and I chatted about how this could be implemented, and couldn't come up with a design that would give anonymous, read-only access without exposing reviewer identities or the dataset. I assume some sort of "anyone with the link can view" feature is what you're imagining?
Yes, I think that is what I imagine.
Something like this would be really useful, but it would require changes on both the Metacat and MetacatUI sides. The only work around I can think of that would be usable right now would be to 1. create a temporary account, 2. give that account view access to the dataset, and 3. share the login with the reviewers. Later the account could be removed. Open to other ideas & suggestions! We should create a new issue if it's worthwhile to discuss further.
That could work too. On ESS-DIVE we will probably create secondary storage (outside of Metacat) where the reviewers can access the dataset data.
Ironically, the "Anyone with the link can view" feature was something that much earlier versions of Metacat had through limited-time tokens that expired at a set time period. That feature was around for about the first decade of Metacat, nobody used it, and so we removed it (maybe in Metacat 2.0 but not sure) because it complicated all of our access control checking code. This feature (access via secret link) is fairly hard to implement, as it requires all involved API endpoints to handle authorization requests differently from how they do now. I can imagine a pathway for this but things get fairly complicated quickly... ACLs are very low level. For example, we would need both Metacat and the SOLR auth hook to honor the solution.
This ticket has gotten pretty broad and has drifted fairly far from the original request, which was to create a UI component to allow users to pre-reserve a DOI which will be assigned later when the dataset is published. That is much more tractable, and does not require changes to our underlying authorization system. I suggest we stick to that in this ticket, and open another ticket for other features that were discussed here.
Make sense @mbjones. I appreciate knowing the history of this topic.
In our DataONE team meeting today, we discussed various approaches to implementing a reserve DOI feature. We considered adding a new method to the MN API to list reserved DOIs for a user, which would allow users to retrieve all their reserved DOIs. However, this still lacks a way to associate these reserved DOIs with specific datasets.
Rather than storing the full reserved DOI in the alternativeIdentifier field, we propose storing the DOI minus the prefix in the resource map. This would allow us to associate the reserved DOI with a dataset without causing confusion in search results. MetacatUI can add the prefix to create the full DOI when the dataset is published. One advantage to using the resource map for DOI storage is that it allows us to use the same approach for datasets that use metadata other than EML. If we have the DOI stored in the resource map, no changes are needed to Metacat.
Questions left to answer:
- How will we define the relationship between the reserved DOI and the dataset in the resource map? Do we need a custom term, e.g.
dataone:reservedDOISuffix? Or is there an existing term we can use? - What will the UI look like for a. users to reserve a DOI? b. users to apply a reserved DOI to a dataset & publish it?
Related: https://github.com/NCEAS/metacatui/issues/1380, https://github.com/NCEAS/metacatui/issues/2205