metacatui icon indicating copy to clipboard operation
metacatui copied to clipboard

On download, ensure resolve() endpoints exhaust all object locations

Open csjx opened this issue 7 years ago • 15 comments

We enable direct object downloads using the Download button in the MetadataView, and the URLs in the dataone theme default to CN resolve() URL endpoints. This call returns a response body with a Types.ObjectLocationList payload, and an HTTP 303 redirection code to the first URL in the object location list. If there is any failure on the redirected request, we need to catch the exception and iterate through the rest of the ObjectLocationList to try each replica URL until we succeed with the download. Only after exhausting the list should we show any sort of error message. Currently, we don't catch the error to provide a message - the browser just shows an error in the download status.

We should also think about providing a configuration flag in all themes to request to use the CN resolve URL by default, or to just go straight to the MN get() endpoint.

csjx avatar Dec 11 '17 18:12 csjx

Moving comments from https://github.com/NCEAS/metacat/1407

We've discussed the issue where a repository in the down state, or just inaccessible because of network issues, will not be able to deliver objects with MN.get(). For people downloading from search.dataone.org, they can view the metadata (since it's replicated to the CN), but the data objects are unavailable, even when they are replicated to multiple member nodes.

@laurenwalker pointed out that XHR calls to CN.resolve() to get the ObjectLocationList for each object results in a 30x redirect, and the browser automatically follows the redirect to the (at times unavailable) first replica in the list. The ordering of the replicas on the CN side is being addressed in https://redmine.dataone.org/issues/8853 to de-prioritize replicas that are known to be unavailable (INVALID, down node status, etc). But MetacatUI needs to be able to iterate through the ObjectLocationList nonetheless and try/catch failures until success or the list is exhausted.

From this StackOverflow question we see that the browsers are doing what they are supposed to with XHR calls - follow the redirect.

However, we might be able to use the new Fetch API when calling CN.resolve()and set the Request.redirect property to manual so we can obtain the replica list and loop through it. Of course, this depends on browser support. Other ideas welcome!

I did some quick testing in a JSFiddle, and it turns out that fetch(url, { redirect: "manual"}) does not expose the body of the original request in the response (it is set to null), which is the same case with { redirect: "error"}, which just throws a NetworkError if the response code is 3xx.

Ultimately, there's no way to access the CN.resolve() body in the browser. From discussion here and here, the WHATWG Fetch API specifically disallows access for cross-site scripting security reasons. I would imagine this is the same reason that the browser also does the same for XHR requests (specifically disallowing).

The only two ways forward I see are

  1. Changing the CN response code from 303 to 200 and require the client to iterate through the ObjectLocationList body (unlikely, given this is a full CI stack change - @datadavev @mbjones can comment)

  2. MetacatUI tries to follow the redirect, and catches any errors (404, timeout, etc.) and then creates its own replica list from the SystemMetadata.replica list, which would involve discovering the baseUrl of each replica node from the node list given a nodeId, and tries to fetch the object from each replica in the list until it succeeds. Since the replica list should include status of each replica, we could hit only COMPLETED replicas (skipping those with FAILED|QUEUED|REQUESTED|INVALIDATED).

csjx avatar Nov 25 '19 23:11 csjx

Thanks for bringing the details in here. (2) doesn't sound too bad. Other ideas might be:

  1. Have MetacatUI set either the Accept (or another header, like X-PLZ-NO-REDIRECT: true) to override the redirect behavior at the servlet level. For Accept, the value could be set to either (text|application)/xml or even a string representing the http://ns.dataone.org/service/types/v1#objectLocationList to make it even more clear.
  2. Have the servlet pre-flight the redirect before responding.

amoeba avatar Nov 26 '19 00:11 amoeba

Yeah @amoeba - I like the negotiated Accept type idea, so the default behavior would be 303, and optionally a 200 if the ObjectLocationList is desired and the Accept header is whatever we decide is a good one. Both (3) and (4) still involve a CN change, but as in (1), I suppose those are just d1_cn_rest changes, and not type or library changes. Unless we hear from @datadavev or @mbjones regarding cycles to put towards this on the CN side, I think we would have to go with (2) at a minimum.

csjx avatar Nov 27 '19 00:11 csjx

More comments and ideas from a Slack thread:

lauren 1:44 PM The download via CORS issue is fixed… but no, the DataONE CN resolve service doesn’t seem to work like I would expect it to, since even though the connection to the replica server timed out, the resolve service doesn’t try to fetch the object from a different replica server 1:44 Maybe the other devs here know why that is 1:45 or how that could be fixed

rossdm 1:45 PM ok, strange...it looks like the download links described above are working now new messages 1:46 will be a problem if one replica going offline affects the overall package, kind of the opposite desired effect....will let our users know that the issue seems to be(?) resolved (edited) 1:46 ty for letting me know

chris: 1:48 PM @lauren - Since the resolve service is just a pass-through, this really is a client issue. I think in the short term our only fix is #2 in https://github.com/NCEAS/metacatui/issues/415#issuecomment-558388040 . When we have resources to make changes on the CN side, we could pursue that too.

lauren 1:48 PM I’m just not sure how we are supposed to tell if a download has failed in the browser 1:48 I guess I’ll look into it 1:48 It seems like a lot of work for the client to do

chris: 1:49 PM I think it’s a matter of a try/catch - so if you get a network error or a 404 or the like, you move on to the next replica in the list. 1:50 the error callback in the XHR would need to handle it I think

lauren 1:50 PM That makes sense for XHR downloads, but we just fixed the download functionality so that it only sends XHR when the object is private 1:51 Otherwise, we just make an HTML link with the download attribute and just let the browser handle it

chris: 1:51 PM Ah right. That definitely complicates it 🙂

lauren 1:52 PM I think the only solution would be to show multiple links to the user, for each replica. So the user can manually try another link

chris: 1:54 PM You could also do an XHR call to MNCore.describe(pid) (which is an HTTP HEAD /object/{pid} ) and if that succeeds, put in the link to MNCore.read()

davev 1:54 PM link status should really be tracked by the CNs to prioritize resolve, though that adds a big load to CN work 1:56 It would be great if the CN could receive notice of failed links from clients. Something like a client does the head requests as Chris described and pings the CN with the info 1:57 Kind of like pushing the work onto the clients, but exposing the results of that work to future clients.

chris: 1:57 PM Well, the replica auditor does that to a certain degree (checking fixity and availability), but it also needs some work

csjx avatar Feb 18 '20 21:02 csjx

Bump. @laurenwalker - In the ESS-DIVE discussion today, this bug came up as a priority. Can we add this into a up and coming milestone?

csjx avatar Apr 13 '20 21:04 csjx

Ok, I added it to 2.12.0

laurenwalker avatar Apr 14 '20 16:04 laurenwalker

This issue came up again today. A user tried to access a PNDB-hosted resource map through DataONE, but it failed because of the CORS config on PNDB. The resource is actually available through the CN, and the CN object url is listed in the objectLocationList. See:

curl -s -H "Accept: text/xml" "https://cn.dataone.org/cn/v2/resolve/resource_map_urn%3Auuid%3A2d9baf2c-62c8-41b2-9178-dd68af3b3379" --->

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
  <identifier>resource_map_urn:uuid:2d9baf2c-62c8-41b2-9178-dd68af3b3379</identifier>
  <objectLocation>
    <nodeIdentifier>urn:node:PNDB</nodeIdentifier>
    <baseURL>https://pndb.fr/metacat/d1/mn</baseURL>
    <version>v1</version>
    <version>v2</version>
    <url>https://pndb.fr/metacat/d1/mn/v2/object/resource_map_urn:uuid:2d9baf2c-62c8-41b2-9178-dd68af3b3379</url>
  </objectLocation>
  <objectLocation>
    <nodeIdentifier>urn:node:CN</nodeIdentifier>
    <baseURL>https://cn.dataone.org/cn</baseURL>
    <version>v1</version>
    <version>v2</version>
    <url>https://cn.dataone.org/cn/v2/object/resource_map_urn:uuid:2d9baf2c-62c8-41b2-9178-dd68af3b3379</url>
  </objectLocation>
</ns2:objectLocationList>

As pointed out earlier in this issue and confirmed again today with testing, there is unfortunately no way to access the objectLocationList in the browser. Without making changes to the CN, the browser-based solution is still the one proposed in 2019:

MetacatUI tries to follow the redirect, and catches any errors (404, timeout, etc.) and then creates its own replica list from the SystemMetadata.replica list, which would involve discovering the baseUrl of each replica node from the node list given a nodeId, and tries to fetch the object from each replica in the list until it succeeds. Since the replica list should include status of each replica, we could hit only COMPLETED replicas (skipping those with FAILED|QUEUED|REQUESTED|INVALIDATED).

The solution makes sense for requesting resource maps, but still makes getting the URL for the download buttons complicated, as Lauren pointed out: the download functionality only sends XHR when the object is private. Otherwise, we just make an HTML link and just lets the browser handle it.

robyngit avatar Feb 05 '25 16:02 robyngit

Here is a summary of the suggested server-side solutions to this problem, in case the client-side solutions are not as feasible as we hope:

1. CN returns 200 with the ObjectLocationList instead of a 303

  • CN never returns a 303, but instead a 200 status along with the ObjectLocationList
  • Major change since it modifies default DataONE API behaviour.

2. CN accepts custom header to skip 303 redirects

  • CN continues to return 303 by default, but if the request sets a particular Accept header (or a custom header like X-PLZ-NO-REDIRECT: true), the CN returns a 200 status along with the ObjectLocationList.
  • Gives clients access to ObjectLocationList, but doesn't change default API behaviour

3. CN checks replica availability and re-prioritizes accordingly

  • CN checks replica availability itself (either live or via the "DataONE replica auditor") and prioritize only those replicas known to be up.
  • Reduces but does not eliminate chance of the first location being unavailable
  • High workload for the CN

4. CN accepts feedback from client and re-prioritizes accordingly

  • Client reports to the CN when a given replica link fails. The CN uses that info to de-prioritize/flag replica.

robyngit avatar Feb 05 '25 17:02 robyngit

Here's a couple options:

  1. Implement an optional header like X-Follow-Redirect=False on the CN side so that a browser client can issue a resolve request without being redirected. It may be possible to implement this in Apache config rather than messing with the CN implementation.
  2. Implement a resolver lookup service that returns the resolver response but changes the status code to 200
  3. Get the system metadata for the object and inspect the replica elements, grab the replicaMemberNode values and lookup the target in the node list. ~~4. Query the index for object locations, dataUrl should work, but it doesn't list all the targets.~~

others?

datadavev avatar Feb 05 '25 17:02 datadavev

  1. On the server side, use content negotiation - return 200 if text/xml is requested, otherwise (or maybe only if text/html is requested) we return 303 -- I actually thought this was what we had specified originally but reading the CN resolve docs I see we say it should always be 303. So that seems to not be the case. If a client requests XML or JSON, for example, they are unlikely to want a redirect here.

mbjones avatar Feb 05 '25 17:02 mbjones

for 5. - need to verify none of the existing MN or CN code relies on resolve response in text/xml.

Might be better to have a custom content type, e.g. application/vnd+dataone+json or some such.

datadavev avatar Feb 05 '25 17:02 datadavev

wrt responding to a custom header, this is where response status code is being set in d1_cn_rest: https://github.com/DataONEorg/d1_cn_rest/blob/0609c11d66605a60a297a2264ae58120c11eb371/src/main/java/org/dataone/cn/rest/v2/ResolveFilter.java#L307

datadavev avatar Feb 06 '25 16:02 datadavev

Option 6, similar to what Chris suggested earlier:

  • MetacatUI detects failure after redirect fails
  • Then queries SOLR for list of replica nodes (replicaMN) - example query
  • Then reconstructs the URL using the node list
  • Then goes down the list and tries each URL until one is successful...

robyngit avatar Mar 03 '25 21:03 robyngit

Option 7 (?)

  • Solr has the replica field which lists the nodes that hold replicas. Could we add a new Solr field that gives the actual complete URLs for the objects?
  • Then the process of finding the available replica would be a bit more straight-forward for MetacatUI:
    • MetacatUI detects failure after redirect fails
    • Then queries SOLR for replica URLs
    • Then goes down the list and tries each URL until one is successful

robyngit avatar May 01 '25 16:05 robyngit

It is hard to index the actual URL, because it can change at any time, which would then mean reindexing all objects for a node and knowing when to trigger that reindex. The node identifier is always consistent, and can easily be used to construct the current URL for a replica object. You probably already have the nodelist cached, and it only needs to be refreshed periodically.

  • for each replica node, lookup ${nodeid} --> ${baseurl}
  • URI is ${baseurl}/{v1,v2}/object/${pid}

Eventually it would be nice to mark nodelist entries as up or down more dynamically, which would also give the client more info to decide whether to even try a given replica. In that case, we'd want to get the node metadata just before trying to resolve objects to get the current state of affairs.

mbjones avatar May 02 '25 00:05 mbjones