openbeta-graphql icon indicating copy to clipboard operation
openbeta-graphql copied to clipboard

Add an api to upload an image

Open Vichy97 opened this issue 1 year ago • 5 comments

Right now I need to do a lot of steps to upload from mobile. I need to:

  • generate a uuid
  • request a url from google url service with that uuid in the url
  • upload the blob to that url
  • make a mutation to include the tag for the area/climb i'm uploading to

This is a lot of steps and not very robust in the case of failure of any of the steps. It also will be very difficult to support uploading images while offline. We should have a single mutation that does all of this.

Vichy97 avatar Jan 08 '25 16:01 Vichy97

I wouldn't mind cleaning up some of the flow to do with the media upload. @Vichy97 are you hoping to pass the blob through the gql layer and then have it resolve as an openbeta media object, or are you just wanting to see some of the steps reduced / the control flow made a bit more clear?

CocoisBuggy avatar Apr 08 '25 17:04 CocoisBuggy

Ya I'd love to be able to just upload the blob and get the media object back. I don't think the client should have to generate the uuid and urls or have to make multiple api calls.

Vichy97 avatar Apr 08 '25 17:04 Vichy97

References:

sequenceDiagram
    User->>Openbeta: Can I have a signed URL for uploading a file?
    Openbeta->>Openbeta: Check user auth
    Openbeta->>Google Cloud: Can I have a signed url?
    Google Cloud -->> Openbeta: Here is a signed url
    Openbeta-->>User: you are authenticated, so here is your url
    User->>Google Cloud: Here is the file
    Google Cloud-->>User: Okay
    User ->> Openbeta GQL: Okay, the media is created, here is the url.
    Openbeta GQL-->>User: Okay

This is a lot of moving parts holding their own state that may become orphaned if the user drops out at some point in the process. The suggested flow is something like

sequenceDiagram
    User->>Openbeta GQL: Here is an image file
    Openbeta GQL->>Openbeta GQL: Check user auth
    Openbeta GQL->>Google Cloud: Here is a file for you
    Google Cloud -->> Openbeta GQL: Okay
    Openbeta GQL->> Openbeta GQL: Create entry in database
    Openbeta GQL -->>User: Here is the media object

This cuts down on the moving pieces quite significantly, as now the user only needs to make one request, but it is by no means a free lunch. The most viscous issue being that file upload is very heavy for our poor little GQL endpoint. In a kinder world our little javascript baby would be better at concurrent IO, but as it stands nodejs humbles our api a little bit. However, do not fret. There is a nearly-as-convenient construct we can use, which looks like this.

sequenceDiagram
   User->>Openbeta GQL: Here is an image file
   Openbeta GQL->>Openbeta GQL: Check user auth
   Openbeta GQL->>Google Cloud: Can I please have a signed url for my user?
   Google Cloud-->>Openbeta GQL: ofc bb 😘
   Openbeta GQL->> Openbeta GQL: Create entry in database for the pending media
   Openbeta GQL -->>User: Here is an un-reified media object, upload to <signed url>
   User->>Google Cloud: Here is the file
   Google Cloud-->>User: Okay
   Google Cloud->>Openbeta GQL: A new media object just got added with <filename>
   Openbeta GQL->Openbeta GQL: Update the un-reified media object such that it is finalized

This doesn't look like an improvement, except that now the user only needs to make one request to the GQL endpoint, and then is responsible for reifying the object, or it will expire. In other words, when we hide the control flow that the user doesn't need to care about we get

sequenceDiagram
   participant Google Cloud
   User->>Openbeta GQL: Here is an image file
   Openbeta GQL -->>User: Here is an un-reified media object, upload to <signed url>
   User->>Google Cloud: Here is a file
   Google Cloud-->>User: Okay

Google cloud will sadly not notify us when a signed url expires, thereby making the reference stale. As a result, the mutation that creates new media for a user should clean out any stale items as well (Although I will be looking at using Mongodb TTL to automatically clear these items.)

I want to mention #461, since it's on my todo as well

CocoisBuggy avatar Apr 09 '25 09:04 CocoisBuggy

What is wrong with the second diagram? I'm not sure I understand the limitations there. The second diagram is basically exactly what I had in mind

Vichy97 avatar Apr 09 '25 14:04 Vichy97

Apollo recommends handling the file upload itself out-of-band of your GraphQL server for the sake of simplicity and security. This "signed URL" approach allows your client to retrieve a URL from your GraphQL server (via S3 or other storage service) and upload the file to the URL rather than to your GraphQL server. [The blog post mentioned above goes into detail on how to set this up.](https://www.apollographql.com/blog/file-upload-best-

There are a couple of reasons why Apollo says this, but I expect that the chief reason is that the GQL specification does not tell us what to do with large blobs. Apollo actually used to ship their server framework with native file upload support but it has been since made end of life and they don't seem at all keen to provide support for that use case going forward. It's hard to provide robust support for file upload when the specification doesn't give us an easy option.

Summary

In practice I was observing that the cost of scaling the approach wouldn't offset the relative ease of uploading directly to google (remember: if we upload as a proxy wrapper, we pay twice. Once for the traffic to the GQL endpoint and then once again when we pay google to recieve our traffic).

Details

I want to mention that these tests run with load being generated on a dedicated 8-core 32gb server, and the host machine is my standard development machine which is spec'd the same. These are very unrealistic production environments for us, but I will only instrument these same tests on a constrained machine if it's necessary.

Control test

let me ground your expectations: If we write a resolver that does basically nothing like this one

  nothing: async (_: any, args, { dataSources }: GQLContext): Promise<boolean> => {
    const { input }: { input: string } = args
    return true
  },

and then run a flaming-hot benchmark on it sending a little string like "hello, openbeta" (A minute payload) then we can more or less measure the overhead of GQL itself. It doesn't do too badly in my view, if we use k6 to simulate 12,000 users making 200,000 requests (Roughly 100 each) then the node environment starts to take some real strain:

Image

It's just enough traffic that some requests start failing for no other reason than that the nodejs server cannot fulfil it. In a way, the nodejs runtime is designed for exactly this kind of concurrency, but we still see about 7% of requests failing for one reason or another. The data going back and forth between these two is in a local network (though they are separate machines to at least force the full network stack to step up) so we expect the performance to be good (It's not necessarily good in the world of non-js apis but is still good. - nginx yielded more than double the performance under the same test parameters).

Let me be clear: This is pretty good performance. It demonstrates that the Apollo layer has a nice low overhead and doesn't munch up all the available ram trying to fulfil these thousands and thousands of requests. For our purposes this is a nice and aspirational throughput (3,600 requests a second) but this is where the challenging reality check comes in:

A test using a real endpoint.

For this test I have selected the simple 'select climb by uuid' query which doesn't have too big of an overhead aside from the database call. We expect this to hammer the computer MUCH more since it will be doing actual IO and more memory allocations per request.

Image

This is where things get real, since now we are doing a trip past the database with every request and sending more data than before. Under the same conditions as above, this new test yields one third of what we were getting before (1100/s) which is still well within reason for us - though remember the host machine is not demonstrative of a production environment.

A test using an upload mutation

The most generous upload mutation is the one I have implemented for this test:

  upload(input: String): Boolean
upload: async (_: any, args, { dataSources }: GQLContext): Promise<boolean> => {
    const { input }: { input: string } = args
    for (const x of input) {
      continue
    }
    return true
  },

we then send it a file containing some arbitrary random data of a size roughly equal to a 2mb file (Our current max is 11mb). We pass the data in base-64 to the api and pray to the lord above that neither machine explodes from the effort. If we run the load test with the above parameters we get a very predictable result: The machine generating the load immediately locks up as the ram usage hits the ceiling. oops.

For times and moneys sake we are going to turn down the parameters - 100 users will try and upload 1000 images between them (10 each on average). the results?

Image

Quite substantial data transport compared to the previous tests. The benchmark takes forever to run, and moves about 5 and a half gb in the process. the requests per second? 2. This on its own is really not that big of a deal, since we are FAR away from injesting 2 photos a second to the openbeta platform, but I'm sure you can see why this might be a cause for concern.

The current climb resolver can manage about 1k/s, the photo upload manages 2/s. You might be asking yourself at this point "Does that really matter? all the other requests are quick and the photo uploading wrapper is mostly in the nodejs network stack which is largely off the main thread."

You are quite right. Just to make sure we are not overlooking some optimization we can truncate the resolver to simply injest the data, and nothing more

  upload: async (_: any, args, { dataSources }: GQLContext): Promise<boolean> => true

In theory, this test should tell us nothing but the cost of taking in an image of ~1.8mb to the endpoint. In the real world, it's not a good test for the cost of the proposed use case but if it's bad then we know this is not a great way to approach this problem.

Image

Ouch. Those are not good for our health. But okay we still don't understand the true nature of what this will do to the average user, only 10% of image uploads failed so what is happening to users making normal requests to the db at this time? The event loop in the nodejs runtime should let these two unconnected parts work without stepping on each others toes (even if it is a bit slower)

That's true! If we run the load produced by people uploading things and also the load of normal database requests we get this

Image

Hey! the error rate is barely any different to how it was in the initial test, and we're still doing 800 requests a second in spite of the slow photo ingestion load! Maybe this is viable after all!

Conclusion

We are in a situation where our development time is donated, while our compute time still costs the same very real currency as any other software company. There are risks and pitfalls to wrapping the upload - what I do not mention in the above test is just how much memory it takes to facilitate this kind of throughput 🧠

In my opinion, the flow I proposed is not too bad. The application developer makes two requests, and they flow together pretty nicely. The dev doesn't even need to tell the server how the upload went, they can safely wait for that unverified object to expire.

What are your thoughts? is the convenience worth it?

CocoisBuggy avatar Apr 09 '25 19:04 CocoisBuggy