graphql-wg icon indicating copy to clipboard operation
graphql-wg copied to clipboard

[Experiment] Generic bulk queries in GraphQL

Open eapache opened this issue 4 years ago • 2 comments

Public docs at https://help.shopify.com/en/api/guides/bulk-operations/, but this is still experimental/beta.

At Shopify we've seen a lot of clients who need to fetch large quantities of data (e.g. all 10k products on a shop). We follow relay pagination, so typically they would do this via paginated requests (e.g. 100 requests for 100 products each) but this is inefficient in that it results in a lot of round-trips, a lot of query processing overhead, some concatenation logic on the client, etc.

As a better solution, we're experimenting with a subtle variant of our GraphQL schema (or perhaps more accurately a variant of the relay pagination spec?) where clients don't specify any of the regular first/last/before/after arguments. Instead, our server iterates over pages of the query transparently, and returns the total results (e.g. all 10k products) in a single file downloadable from S3 or similar storage hosts. Details are at the link above.

As a general pattern I think this might be something that is useful or at least interesting to the broader community, which is why I wanted to share it. There was some interesting conversation in the working group meeting over whether this was technically spec-compliant or not, and I'm also interested (if it works out long-term) in potentially working this into the spec. In my mind it's similar to subscriptions, in that it's an another variant on how the data actually gets returned, built on top of the same fundamental schema (normal queries are synchronous pull, subscriptions are synchronous push, and this bulk API is asynchronous pull).

We've also talked about variations of this approach where it's more explicitly not-spec-compliant, e.g. making a separate end-point for these requests which takes the graphql request in the normal way but just returns the ID of the job that was launched instead of any spec-compliant response.

Whatever thoughts or opinions you have, please share. We're obviously paying a lot of attention to what our clients think of this, but we also want to be good citizens if there is interest or concern from the GraphQL community.

eapache avatar Sep 12 '19 18:09 eapache

Subscriptions can be an acceptable way to make this work. Subscription can return initial payload which can be some info about the size and time and the client subscribes to get S3 URL once the batch job is done.

wtrocki avatar Sep 17 '19 22:09 wtrocki

Subscriptions would have one big advantage, in that you could use a normal graphql selection set to describe the query instead of using a string. For example:

mutation {
  bulkOperationRunQuery(
   query: """
    {
      products {
        nodes {
          id
          title
        }
      }
    }
    """
  ) {
    bulkOperation {
      id
      status
    }
  }
}

turns into

subscription {
  bulkOperation {
    products {
      nodes {
        id
        title
      }
    }
  }
}

Writing the query would much easier because you can leverage the existing tooling. The downside is that communicating the job information to the client doesn't use a normal graphql selection set. You could still do it by sending the information in the extensions field, but it's not as discoverable. For example:

{
  "data": null,
  "extensions": {
    "bulkOperation": {
      "id": "gid:\/\/shopify\/BulkOperation\/720918",
      "status": "CREATED"
    },
  }
}

You might be able to solve that by letting the client specify where the data should go as an argument to the subscription. For example:

subscription {
  bulkOperation(destination: "s3://my-bucket/my-key") {
    products {
      nodes {
        id
        title
      }
    }
  }
}

or even have the destination be a url that receives a webhook with the bulk operation url.

dwwoelfel avatar Sep 20 '19 16:09 dwwoelfel

Closing as stale (related to #1413, though this isn't technically an action item)

I should note that this is one of the things that I'd like to use @stream for, which is why I'm pushing for it to be doable with minimal memory overhead (i.e. no complex tracking of previously delivered data).

benjie avatar Nov 10 '23 12:11 benjie