metacat icon indicating copy to clipboard operation
metacat copied to clipboard

New Feature: Metadata Replication from EDI to ADC

Open dvirlar2 opened this issue 2 years ago • 7 comments
trafficstars

Original Request: For @taojing2002 to update the authNode policy on metadata records cloned from EDI, so that he could update the system metadata slots obsoletes and obsoletedBy to maintain the dataset's version chain.

Dataset Context: We have version 2.4 of the "Dissolved organic carbon (DOC) and total dissolved nitrogen (TDN) from river, lagoon, and open ocean sites along the Alaska Beaufort Sea coast, 2018-ongoing", which was originally published on EDI here. An Nguyen and Tim Whittaker of the Beaufort Lagoon Ecosystems LTER site have requested that we update our version of this dataset (and many others) to be in-sync with the latest version on EDI's site, which is version 2.6. These requests occurred in Ticket 25955 and Ticket 26032.

Upon cloning version 2.5 and version 2.6 of the dataset from EDI, we realzied the obsoletes and obsoletedBy fields were not identical to the original versions (they were set to NA by the datamgmt::clone_package() function). This resulted in all three versions (2.4, 2.5, and 2.6) being available in ADC's catalog, rather than only the newest version.

Moving Forward: Seeing as how there are seven other datasets we need to update for these tickets, we would like to automate the replication process of updated datasets from EDI to the ADC.

dvirlar2 avatar Apr 06 '23 22:04 dvirlar2

More comments to come from @mbjones

dvirlar2 avatar Apr 07 '23 18:04 dvirlar2

@taojing2002 -- what I'd like to do here is to change our allowedNode list (see https://dataoneorg.github.io/api-documentation/apis/Types.html#Types.NodeReplicationPolicy.allowedNode) to include urn:node:EDI (or maybe it is LTER, not sure for this dataset), such that when EDI marks a dataset to be replicated with a preferredNode set to urn:node:ARCTIC, we will get a copy automatically without further manual intervention. For this to work, I think we need @twhiteaker to set the replicationPolicy for the datasets they would like to be replicated to ADC. I'm not sure how that would work for him, and I'm not sure if he has a mechanism to request such a specific replicationPolicy for data he sends to EDI, but let's discuss it with him. As some of those datasets are already partially replicated to the ADC, we may need to clear some of the versions from our repo so that the new copies can make their way to us. Let's discuss if this is feasible, as its an ongoing issue for the datateam and for a couple of LTER sites.

mbjones avatar Apr 07 '23 20:04 mbjones

Following an example that @servilla gave me for ARC LTER, would this work for setting the replication policy in EML? Would this make it so that I no longer have to notify ADC when we've published a dataset or revision in EDI?

<additionalMetadata>
  <metadata>
    <d1v1:replicationPolicy xmlns:d1v1="http://ns.dataone.org/service/types/v1" numberReplicas="1" replicationAllowed="true">
      <preferredMemberNode>urn:node:ADC</preferredMemberNode>
    </d1v1:replicationPolicy>
  </metadata>
</additionalMetadata>

twhiteaker avatar Apr 10 '23 15:04 twhiteaker

That might work (although the node identifier for the Arctic Data Center is urn:node:ARCTIC). If EDI is watching that EML aditionalMetadata field for the replication policy and then adding that to the system metadata for each object that they publish to DataONE, then that would work and be aligned with what I was proposing. We have a bit of setup on the ADC end to get this all to work if EDI is already setting replication policies.

mbjones avatar Apr 10 '23 17:04 mbjones

Hi Matt,

ARC LTER has been using this approach for a number of years now. I assume it is working since I have heard nothing to the contrary. And yes, we monitor "additionalMetadata" for this specific use case. Let me know if this is not working and we'll address it accordingly.

Sincerely, Mark


Mark Servilla @.***

On Mon, Apr 10, 2023 at 11:05 AM Matt Jones @.***> wrote:

That might work (although the node identifier for the Arctic Data Center is urn:node:ARCTIC). If EDI is watching that EML aditionalMetadata field for the replication policy and then adding that to the system metadata for each object that they publish to DataONE, then that would work and be aligned with what I was proposing. We have a bit of setup on the ADC end to get this all to work if EDI is already setting replication policies.

— Reply to this email directly, view it on GitHub https://github.com/NCEAS/metacat/issues/1615#issuecomment-1502066338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7EU4JIZCF55KCLL6PXTDXAQ4VXANCNFSM6AAAAAAWV56DRA . You are receiving this because you were mentioned.Message ID: @.***>

servilla avatar Apr 10 '23 17:04 servilla

@mbjones @servilla @twhiteaker The feature to support allowedNode was implemented in Metacat and tested in the cn sandbox environment a while ago. It seems working on the sandbox env. However, there was a glitch when we applied it to the replication between pisco and opc in the production environment. Replication of opc was turned on and set up to accept objects only from pisco; the replication preference policy to opc was added to one object on pisco. If everything went well, the object from pisco should be replicated to opc. However, it landed on another member node, which is not in the preference list but accepts replicas from any nodes.
Rani and I didn't dig around this issue. But my intuition is that besides the Metacat implementation, the replication service on cn needs some modification. Since the replication to opc is on, cn keeps sending objects to opc. However, opc will reject those objects which are not from pisco. So many failures to opc will make cn mark it as a not-reliable node. So cn will stop sending objects to opc anymore. Eventually the object was sent to another node. To make this feature work, we need to let cn understand the allowednode feature - only sending specific objects rather than fanning out any objects to this node.

Note: we only tested one object.

taojing2002 avatar Apr 10 '23 17:04 taojing2002

Thanks, Mark. Although we had discussed using that mechanism with ARC LTER, I think we're still manually replicating from them as well. But we'd love to change each of them to be automated, so we will start down that path.

@taojing2002 I was aware of the issues with preferredNode, but it should still work most of the time. But I agree it would be good to support a more explicit interpretation that requires a replica be made on a specific node. Let's discuss that again and see what it would take -- I think there's a reasonably fast approach we could take in which we try to replicate, let the CN do what it needs if it can't find the preferredNode, but then later comes back and checks those again to try to satisfy the replication policy request.

mbjones avatar Apr 10 '23 21:04 mbjones