metacatui icon indicating copy to clipboard operation
metacatui copied to clipboard

Add distribution URL to dataset

Open jeanetteclark opened this issue 5 years ago • 16 comments
trafficstars

Describe the feature you'd like

I would like for metacatUI to automatically add a distribution URL to the dataset element.

Is your feature request related to a problem? Please describe.

This is related to a check in the metadig FAIR suite, which looks for whether a resource landing page is present. The check looks at this XPATH /eml/dataset/distribution/online/url[@function="information"] |.

Additional context

Here is an example:

    <distribution>
      <online>
        <url function="information">http://test.arcticdata.io/view/urn:uuid:026ea3ef-809c-4c4b-9c38-1c7587641a6f</url>
      </online>
    </distribution>

The base URL should be the view service for whatever member node is being used, or if the dataset has DOI, it should use doi.org

jeanetteclark avatar May 08 '20 22:05 jeanetteclark

If we really wanted to be thorough, we could also add a url with the attribute download pointing to the coordinating node URL for the object itself.

jeanetteclark avatar May 08 '20 22:05 jeanetteclark

Bump. This should be a priority, @mbjones can follow up with other thoughts

jeanetteclark avatar Jan 05 '23 19:01 jeanetteclark

The distribution URL should be the view URL if the dataset has a UUID, otherwise the DOI url if the dataset as a DOI.

robyngit avatar Jan 12 '23 22:01 robyngit

Update:

The soon-to-be-released develop branch has a number of updates that support this feature, including:

  • Updates to the EMLDistribution model:
    • implemented methods to parse distribution EML & update distribution DOM
    • support for the url & urlFunction elements
    • add documentation & unit tests
  • EML model now serializes distribution element during serialization (previously it was skipped, such that no elements were updated)
  • add/move methods to handle DOIs to the appModel (we need these methods to create distribution URLs when the dataset has a DOI.)

Further work on this feature continues in the feature-1380-auto-add-dist-url branch. See https://github.com/NCEAS/metacatui/commit/b80f8e05bc9666d29a6e320d39333d15b82d847b.

Remaining work to be done:

  • [x] Test changes made in https://github.com/NCEAS/metacatui/commit/b80f8e05bc9666d29a6e320d39333d15b82d847b (test that this automatically adds or updates urls to match the new PID during save.)
  • [ ] Enable adding/updating the url when the Publish with DOI button is pressed

The second task will involve changes to MetadataView.publish method. Instead of using the DataONE publish API, we will need to first generate the DOI, then update the EML with new <distribution> URL, then save the new record. IDs can be generated using baseURL/generate/, see the R package for an example.

I am pushing this feature from the upcoming release to the next one for now.

robyngit avatar Aug 01 '23 21:08 robyngit

Here is what happens in the Editor with the updates currently in the feature-1380-auto-add-dist-url branch:

  1. Before the EML gets serialized, check for the autoAddDistributionURL option in the app config.
  2. Continue to serialize (step 5) if false, otherwise:
  3. Remove old distribution URLs:
  • A <distribution> element is considered an old distribution URL if ALL the following are true:
    • it has an <online> child element
    • the <online> has a child <url> element
    • the <url> element has a function attribute set to "information"
    • the <url> value contains the dataset's new PID, old PID, or seriesId, whether it is url encoded or not.
  1. Add a new distribution URL in the format: <distribution><online><url function="information">{DOI or VIEW URL}</url></online></distribution>
  2. EML proceeds to be serialized as normal.
flowchart TD
    A[press SAVE button]
    C(`autoAddDistributionURL`?)
    D[Remove old distribution URLs]
    F[Add new distribution URL]
    G[EML is serialized as normal]

    subgraph "For Each <distribution> Node"
        Z[start checking node]
        Y(Has < online> child?)
        X(Has child < url> element?)
        W(< url> has 'function=information'?)
        V(< url> contains new PID, old PID, or seriesId?)
        U[Remove it]
        T[Keep it]
        S[done checking node]

        Z --> Y
        Y -- Yes --> X
        X -- Yes --> W
        W -- Yes --> V
        V -- Yes --> U
        Y -- No --> T
        X -- No --> T
        W -- No --> T
        V -- No --> T
        S -. next node .-> Z
    end

    A --> C
    C -- Yes --> D
    D --> Z
    F --> G
    C -- No --> G
    U --> S
    T --> S
    S --> F

robyngit avatar Aug 16 '23 23:08 robyngit

I've been working on how the publish DOI button will need to work in order to keep the online distribution information up-to-date. The notes below show my preliminary ideas... Feedback is very welcome!

How the Publish With DOI button works now:

  • When the button is clicked, a request is sent to the /publish end point. The request includes the PID of the current EML document.
  • If successful, the response will include the new DOI for the EML doc. The resource map gets updated on the backend.
  • The view redirects to the new view URL with the DOI

Proposed new behaviour for the Publish With DOI button:

  • The Publish button is clicked.
  • A new DOI is generated with the /generate endpoint. No PID is included in the request, it's just a reserved DOI at this point.
  • The EML doc is downloaded, parsed, and updated: <distribution> elements that give the online distribution url of the dataset are updated with the new DOI url.
  • The resource map is downloaded, parsed, and relationships updated from the old EML PID to the new DOI.
  • The EML document is saved to the server with the new DOI.
  • The resource map is saved to the server with a new PID.
  • Redirect to the new view URL with the DOI.
sequenceDiagram
    participant U as User (Browser)
    participant S as DataONE (Server)

    U->>U: Click Publish button
    activate U
    U->>S: Request new DOI via /generate/ endpoint
    deactivate U
    activate S
    S-->>U: Return new reserved DOI
    activate U
    deactivate S
    U->>S: Request EML doc
    deactivate U
    activate S
    S-->>U: Send EML doc
    activate U
    deactivate S
    U->>S: Request resource map
    deactivate U
    activate S
    S-->>U: Send resource map
    activate U
    deactivate S
    U->>U: Parse EML & resource map
    U->>U: Update EML doc with new DOI
    U->>U: Update resource map with DOI
    U->>S: Save EML with new DOI
    deactivate U
    activate S
    S-->>U: Success
    activate U
    deactivate S
    U->>S: Save resource map new PID
    deactivate U
    activate S
    S-->>U: Success
    activate U
    deactivate S
    U->>U: Redirect to new view URL with the DOI
    deactivate U

robyngit avatar Aug 17 '23 22:08 robyngit

Overall looks great @robyngit. The other thing the /publish endpoint does is changes access control to make the whole package, including all metadata/ore/datafiles publicly readable if they are not already. This is because we have a policy that data with a DOI are public. Can you add that to your list, and review the /publish implementation to be sure we're not missing something else?

mbjones avatar Aug 17 '23 23:08 mbjones

The publish method is implemented by Metacat in MNNodeService. Given an identifier and a session, it takes the following steps:

  1. Resolve SID to PID: Using the method getPIDForSID, it checks whether the original ID is actually a Series ID and resolves it to a PID if necessary.

  2. Fetch Metadata: Retrieves the system metadata (and Science Metadata?) of the dataset using the getSystemMetadata method.

  3. Mint New Identifier: Generates a new identifier (DOI) for the new version of the dataset using the generateIdentifier method.

  4. Update Metadata: Modifies the new System Metadata to reference the new identifier and to mark the original identifier as obsoleted.

  5. Make Metadata Public: If the original metadata isn't publicly accessible, the makePublicIfNot method ensures that the new metadata is made publicly readable.

  6. Update or Edit Metadata: If the original dataset is a science metadata document (e.g., in EML format), it updates the metadata with the new identifier.

  7. Object Update: Finally, it calls the update method to persist these changes.

  8. Update Resource Map: (optionally?) Updates the resource map (ORE) that describes the relationships between the metadata and any accompanying data. It does so either by finding an existing resource map and updating it or by generating a new one if an existing one is not found. Specifically is:

    • Finds existing resource map: First, the code attempts to find the existing resource map based on a specific naming convention (potentialOreIdentifier). If it doesn't find it, it tries to get the newest resource map for the original identifier (originalIdentifier) from SOLR.
    • Modifies resource map: The existing resource map is modified using ResourceMapModifier to replace the identifier of the original metadata with the new DOI.
    • Prepares new resource map System Metadata: The System Metadata (SystemMetadata) of the existing resource map is copied, and some of its properties are updated, such as setting new identifiers, checksums, and size.
    • Makes resource map public: The new resource map System Metadata is made publicly readable.
    • Updates or creates resource map: Finally, the code updates the existing resource map with the modified one or, in some scenarios, creates a new resource map if one does not exist.
  9. Return New Identifier: The method returns the new identifier (DOI) that was minted for the updated science metadata.

Notes:

  • If an exception is caught, it gets wrapped in a ServiceFailure exception with an error code of 1030 .

Questions

@taojing2002 - Is this summary correct, and am I missing anything here?

@mbjones and everybody - We decided that MetacatUI should implement the above steps rather than using the publish endpoint, in order to support automatically adding the online distribution URL to EML docs. Given the complexity of the task, I'm wondering whether it might make more sense to update the Metacat implementation instead? The publish method already parses and updates the EML with the editScienceMetadata method. From what I can tell, all it would require is to extended the method to also update old distribution URLs. I imagine that this would be a useful function for Metacat to perform generally? Thoughts?

robyngit avatar Aug 22 '23 22:08 robyngit

Great summary, thanks @robyngit The reason I have been pushing you to reimplement in MetacatUI is that we've generally had the policy that Metacat never changes content -- it always just takes instructions from API clients on what to change. the publish method violated that principle, but leaves much to be desired. While it updates some fields within the EML, it does not properly update all of the areas of EML that should be updated. Plus, it doesn't support other metadata standards (like ISO), and so it doesn't conform to the metadata-standard-agnostic API we've had with Metacat. Also, if we release new versions of EML, the publish method would need to be updated to reflect those versions. I'd really like to keep the principle that changing metadata is a client responsibility, and validating metadata and storing it is a Metacat responsibility. But let's discuss -- expediency got us to where we are today, and sometimes we need to take the faster route.

mbjones avatar Aug 22 '23 23:08 mbjones

Oh, and to add one more thing. Because MetacatUI already has the metadata parsed and methods for creating and publishing new versions, I feel like the client-side implementation of this would be modifying a few things on a well-worn trail:

  • parse and load metadata (existing code and process)
  • generate a new identifier (DOI) (new code)
  • add identifier and distribution URL to the metadata model (existing code I think)
  • update access rules (existing code I think)
  • update ORE (existing code I think)
  • serialize and send to metacat (existing code I think)

Maybe on wrong in whether existing MetacatUI code already does all of this. So let's discuss if that is the case.

mbjones avatar Aug 22 '23 23:08 mbjones

The summary looks great!Sent from my iPhoneOn Aug 22, 2023, at 3:26 PM, Robyn @.***> wrote: The publish method is implemented by Metacat in MNNodeService. Given an identifier and a session, it takes the following steps:

Resolve SID to PID: Using the method getPIDForSID, it checks whether the original ID is actually a Series ID and resolves it to a PID if necessary.

Fetch Metadata: Retrieves the system metadata (and Science Metadata?) of the dataset using the getSystemMetadata method.

Mint New Identifier: Generates a new identifier (DOI) for the new version of the dataset using the generateIdentifier method.

Update Metadata: Modifies the new System Metadata to reference the new identifier and to mark the original identifier as obsoleted.

Make Metadata Public: If the original metadata isn't publicly accessible, the makePublicIfNot method ensures that the new metadata is made publicly readable.

Update or Edit Metadata: If the original dataset is a science metadata document (e.g., in EML format), it updates the metadata with the new identifier.

Object Update: Finally, it calls the update method to persist these changes.

Update Resource Map: (optionally?) Updates the resource map (ORE) that describes the relationships between the metadata and any accompanying data. It does so either by finding an existing resource map and updating it or by generating a new one if an existing one is not found. Specifically is:

Finds existing resource map: First, the code attempts to find the existing resource map based on a specific naming convention (potentialOreIdentifier). If it doesn't find it, it tries to get the newest resource map for the original identifier (originalIdentifier) from SOLR. Modifies resource map: The existing resource map is modified using ResourceMapModifier to replace the identifier of the original metadata with the new DOI. Prepares new resource map System Metadata: The System Metadata (SystemMetadata) of the existing resource map is copied, and some of its properties are updated, such as setting new identifiers, checksums, and size. Makes resource map public: The new resource map System Metadata is made publicly readable. Updates or creates resource map: Finally, the code updates the existing resource map with the modified one or, in some scenarios, creates a new resource map if one does not exist.

Return New Identifier: The method returns the new identifier (DOI) that was minted for the updated science metadata.

Notes:

If an exception is caught, it gets wrapped in a ServiceFailure exception with an error code of 1030 .

Questions @taojing2002 - Is this summary correct, and am I missing anything here? @mbjones and everybody - We decided that MetacatUI should implement the above steps rather than using the publish endpoint, in order to support automatically adding the online distribution URL to EML docs. Given the complexity of the task, I'm wondering whether it might make more sense to update the Metacat implementation instead? The publish method already parses and updates the EML with the editScienceMetadata method. From what I can tell, all it would require is to extended the method to also update old distribution URLs. I imagine that this would be a useful function for Metacat to perform generally? Thoughts?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

taojing2002 avatar Aug 22 '23 23:08 taojing2002

Thank you for explaining that, @mbjones! I can see the rationale behind keeping the roles of storage/validation and metadata manipulation separate. I agree it makes sense to implement the new publish behaviour in MetacatUI.

Currently in MetacatUI, some of the code the handles data package management that we need for the publish behaviour is entangled in the EML211EditorView. There's also some model logic in the MetadataView. The new behaviour that we've outlined here belongs in a model, not added on the MetadataView. To proceed, I would like to move code out of these two views and into either the existing DataPackage collection, or into a new model (DataPackageManager?). The new publish method could be added to this model and used by the MetadataView's publish button.

This change is not strictly necessary for the new publish behaviour, but I think it would be a good idea for a few reasons:

  1. It will make the logic more accessible to other views that need it, and remove some redundancy in MetacatUI.
  2. We'll have contained functionality that's easier to understand, maintain, and most importantly: test. (maybe this will even help us to track down the dreaded #1586)
  3. Aligns with the MVC pattern.

Concerns

  • The DataPackage model itself is already quite large and complex (with parts that perhaps should be refactored into separate models).
  • The changes will likely conflict with Rushi's work on the hierarchical package table. @rushirajnenuji - have you made extensive changes to the DataPackage collection and MetadataView? If so, I think it would be best to proceed with this feature after the package table work is merged in the develop.

robyngit avatar Aug 23 '23 19:08 robyngit

Hi @robyngit,

Thank you for checking.

Regarding changes related to the hierarchical package table work, the DataPackage collection does not have major changes. It includes some additional methods for parsing and storing atLocation information and nested package info.

However, I believe there are quite a few changes with the MetadataView. A lot of functionality related to the Package Table has been refactored and/or moved to other views, such as DataPackageView, DataItemView, etc.

rushirajnenuji avatar Aug 23 '23 20:08 rushirajnenuji

That makes sense @rushirajnenuji, thanks! I'll put this issue on hold for now, and continue later when this is merged in develop.

robyngit avatar Aug 23 '23 21:08 robyngit

Some quick thoughts on lifecycle representation in the app for discussion...

stateDiagram-v2
    [*] --> Draft: New
    Draft --> Draft: Save
    Draft --> [*]: Delete
    review : In review
    rev_request: Review requested
    Draft --> rev_request: Review
    rev_request --> review: StartReview
    review --> Draft: RequestRevision
    review --> Approved: Approve
    Approved --> Published: Publish
    Published --> Draft: Edit
    Published --> [*]

mbjones avatar Oct 17 '23 21:10 mbjones

@mbjones, thanks for this diagram! I opened #2205 as a place to continue the discussion on the publishing workflow, and included your comment there as well. This way, we can dive deeper into the workflow discussion without losing track of the specifics here.

Let's keep this particular issue focused on: 1) Automatically adding the distribution URL & 2) Moving the functionality that's currently in the MNNodeService publish method to MetacatUI, in support of the first point.

robyngit avatar Oct 18 '23 13:10 robyngit