pyDataverse icon indicating copy to clipboard operation
pyDataverse copied to clipboard

Add Signed URL Support

Open shoeffner opened this issue 1 year ago • 9 comments

As discussed in #192, it could be a good idea to include Signed URL support as described in https://guides.dataverse.org/en/latest/api/external-tools.html#signed-urls.

This issue is there to discuss, track, etc. that proposal.

shoeffner avatar Jul 18 '24 23:07 shoeffner

If we want to help out upstream, a report came in about the "edit" API endpoint not working with signed URLs. Maybe we try to reproduce the error:

  • https://github.com/IQSS/dataverse/issues/10704

pdurbin avatar Jul 19 '24 21:07 pdurbin

Unless I have the wrong thing, it looks like this call requires PUT not POST: https://github.com/IQSS/dataverse/blob/52e969e1fc69be19356c955f31551d605a276b8e/src/main/java/edu/harvard/iq/dataverse/api/EditDDI.java#L92-L96

qqmyers avatar Jul 19 '24 21:07 qqmyers

Yep, I tried a PUT request and it worked! Thanks @qqmyers!

nana-boateng avatar Jul 22 '24 05:07 nana-boateng

While trying this out, I found out a couple of things:

  • demo.dataverse.org returns a 503 with the message: {status:"error", message:"Endpoint available from localhost only. Please contact the dataverse administrator"}, which should definitely be handled gracefully
  • With the localhost docker server, when sending a malformed JSON as a signing request – I added an extra comma before the closing brace, i.e., {"a": "b",} – the server returns 500 with no message ({} is the actual response), but the server logs state the error:
    dataverse         | [#|2024-07-25T00:50:23.835+0000|SEVERE|Payara 6.2024.6|edu.harvard.iq.dataverse.api.errorhandlers.ThrowableHandler|_ThreadID=91;_ThreadName=http-thread-pool::http-listener-1(1);_TimeMillis=1721868623835;_LevelValue=1000;|
    dataverse         |   _status="ERROR";_code=500;_message="Internal server error. More details available at the server logs.";_incidentId="e748f4e3-3aaa-46ca-8fbe-6e1b68968cb3";_interalError="JsonParsingException";_requestUrl="http://localhost:8080/api/v1/admin/requestSignedUrl";_requestMethod="POST"||#]
    dataverse         | 
    dataverse         | [#|2024-07-25T00:50:23.837+0000|SEVERE|Payara 6.2024.6|edu.harvard.iq.dataverse.api.errorhandlers.ThrowableHandler|_ThreadID=91;_ThreadName=http-thread-pool::http-listener-1(1);_TimeMillis=1721868623837;_LevelValue=1000;|
    dataverse         |   jakarta.json.stream.JsonParsingException: Invalid token=CURLYCLOSE at (line no=6, column no=1, offset=183). Expected tokens are: [STRING]
    dataverse         |     at org.eclipse.parsson.JsonParserImpl.parsingException(JsonParserImpl.java:455)
    
    (removed the stacktrace for brevity). This should probably throw a 400 instead and maybe even return the JsonParsingException output, as that is really helpful to understand what went wrong with the request. In either case, it should be handled gracefully and maybe reported upstream, if it wasn't already done.
  • I didn't find a documentation for the correct/expected response to a signing request; in general the whole signed URL mechanism seems very sparsely documented. For references, this is what a correct response seems to look like:
    {
        "status": "OK",
        "data": {
            "signedUrl": "..."
        }
    }
    
    with the signedUrl being of the form: {originalUrl}?until={isoformatted-UTC}&user={username}&method={http-verb}&token={a long hexadecimal token}. Note that the isoformatted timestamp has a T separating date and time, no timezone information (but is UTC), and ends in 1/1000 s (0000-00-00T00:00.000).
  • I don't know from the docs if the signed URLs work for everything, or if, e.g., the SWORD API, which has a different authentication mechanism, works with it; nor if both, API and Bearer tokens work for signed URLs. However, I think the httpx.Auth auth_flow should support all of that nonetheless with, e.g., yield from signing_request_auth_flow(...) or something.

So far I only wrote unit test cases and haven't done any actual implementation yet (so the tests might still change a little bit here and there, especially the test_signed_url_flow_tracks_data_payload), I also want to add an integration test against the localhost server to see that the flow actually works. https://github.com/shoeffner/pyDataverse/commit/6386001196880014e99f386c65a88b3a4325b512 The branch relies on #201 , so it might become rebased or modified depending on that branch.

shoeffner avatar Jul 25 '24 01:07 shoeffner

How is pyDataverse supposed to be using SignedURLs?

  1. To generate (and distribute) signed URLs for other users/applications to use?
  2. To request a signed URL per request (potentially caching them for repeated requests until they time out?)
  3. To use signed URLs supplied from another application?

Am I missing something?

So far, I focused on 2 (but this kind of entails 1), but I haven't made up any thoughts about 3, although that might also work, in particular if we implement "caching", then we could simply fill the cache with pre-signed URLs.

Another interesting thing I read is (emphasis mine):

url - the exact URL to sign, including api version number and all query parameters

Right now, pyDataverse mostly ignores the API version, so pyDataverse can either modify the URL before requesting a signature for it, or all routes should be changed to include the API Version (which relates to #197 and #189). Ironically, the canonical example uses /api/admin/requestSignedUrl without a version argument.

shoeffner avatar Jul 27 '24 09:07 shoeffner

To shed a little bit more light on my question on what should be supported, here's a little overview over the three variants.

Variant 1

This would be variant 1, and from what I understand, the "common" use case for signed URLs (at least from Dataverse's perspective):

sequenceDiagram
    participant User
    participant Admin/App
    participant pyDataverse
    participant Dataverse
    User->>Admin/App: Asks to perform action
    Admin/App->>pyDataverse: Requests signed URL
    pyDataverse->>Dataverse: Requests signed URL
    Dataverse->>pyDataverse: Issues signed URL
    pyDataverse->>Admin/App: Passes signed URL
    Admin/App->>User: Forwards signed URL
    User->>Dataverse: Uses signed URL to perform action
    Dataverse->>User: Responds with result

For this, pyDataverse would need to be able to returned signed URLs instead of actually performing the actions.

Variant 2

This is, as far as I understood, more or less the idea proposed with "we should support signed URLs" in #192.

sequenceDiagram
    participant User
    participant Admin/App
    participant pyDataverse
    participant Dataverse
    Admin/App->>pyDataverse: Asks to perform action
    pyDataverse->>Dataverse: Requests signed URL
    Dataverse->>pyDataverse: Issues signed URL
    pyDataverse->>Dataverse: Uses signed URL to perform action
    Dataverse->>pyDataverse: Responds with result
    pyDataverse->>Admin/App: Delivers result

Thinking more about signed URLs, I am not sure this is what is intended, as it eventually only increases the number of API calls.

Variant 3

In this scenario, the user wants to perform an action and some other App/Admin generates the signed URL which the user can then use with pyDataverse. This is also very in line with the "common" use case for signed URLs, but this time the role of pyDataverse is flipped: Instead of providing the signed URLs for other services as in Variant 1, it becomes the consumer of signed URLs.

sequenceDiagram
    participant User
    participant Admin/App
    participant pyDataverse
    participant Dataverse
    User->>Admin/App: Asks to perform action
    Admin/App->>Dataverse: Requests signed URL 
    Dataverse->>Admin/App: Issues signed URL
    Admin/App->>User: Forwards signed URL
    User->>pyDataverse: Provides signed URL
    pyDataverse->>Dataverse: Uses signed URL to perform action
    Dataverse->>pyDataverse: Responds
    pyDataverse->>User: Delivers result

Summary

In general I think the idea of using signed URLs for auth does not make too much sense on a per-request basis, at least not "transparently," that is when the result is to perform the action. It does make sense to support either or both of the other two use-cases though:

  • A user of pyDataverse might receive signed URLs via another app or administrator and wants to use those to perform actions with pyDataverse, e.g., scripting some data uploads.
  • A user of pyDataverse implements a service where pyDataverse is the backend and delivers URLs which can be used in a frontend without exposing the API key. In this case, pyDataverse would only generate the URLs and return those as a result, rather than performing the actual action.

The other variant, getting a signed URL and performing the action based on it, could then be naturally combined from Variants 1 and 3, but I do not really see special use cases for that.

What is your take on this?

shoeffner avatar Jul 28 '24 13:07 shoeffner

I started implementing version 2 and also the tests on my branch kind of assume that flow, but I realized that this is not really the way it should work.

For the other two variants, pyDataverse as "requester" and "consumer", I am not sure how good implementations would look like and what @JR-1991 is planning in #189 , but I feel this should be conceptually integrated when restructuring the API. Maybe some sort of middleware-y design might work to avoid having to implement everything twice, so that one can inject signed URL handling in variant 1, variant 3, and by injecting both variant 2 ;-) But I have to think about that a little bit, not sure what the best design is. Do you know how other tools solve this?

shoeffner avatar Jul 28 '24 13:07 shoeffner

I checked the boto3 docs and they seem to handle this in a very direct fashion, essentially instead of performing an action they change the function call to generate_presigned_url: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html and just state

The user can download the S3 object by entering the presigned URL in a browser. A program or HTML page can download the S3 object by using the presigned URL as part of an HTTP GET request.

They also only seem to implement variant 1, i.e., generating presigned URLs ; for variant 3 they basically say: use it with curl.

shoeffner avatar Jul 28 '24 13:07 shoeffner

I think event hooks might be a good idea: https://www.python-httpx.org/advanced/event-hooks/ For signing a URL one could rewrite the URL to the request a signed URL with a request hook, and extract the URL instead of the correct response with a response hook or something. I am not yet sure how that would look like on an API-level though, I can imagine any of the following options and there are probably way more:

# explicit args
api = DataAccessApi(URL, auth=..., middlewares=[SignedURLHook])

# wrapper
api = SignedURLs(DataAccessApi(URL, auth=...))
# alternative per method
api = DataAccessApi(URL, auth=...)
SignedURL(api.get_datafile(...))

# inheritance
class SignedURLDataAccessApi(DataAccessApi, SignedURLApi):
    pass
api = SignedURLDataAccessApi(URL, auth=...)

# extra method/property
api = DataAccessApi(URL, auth=...)
api.signed_url.get_datafile(...)

I am not sure how to do it the other way around, but similar ways might work, maybe with a signed_url_store which stores signed urls. Or some API which allows api.get(signed_url).[as_dataset()|as_file()|...], Dataset(api.get(...)), or Dataset.from(api.get(...)).

One option which might entail quite some refactoring would be to have a template per URL and match the template for signed URLs or realize them for non-signed URLs, depending on the use case.

shoeffner avatar Jul 29 '24 07:07 shoeffner

In the group meeting on August 21st, we decided to put this on hold until there is a need for it.

If you are in need of this feature, please re-open the issue.

shoeffner avatar Aug 24 '24 15:08 shoeffner