sui [GraphQL] Add a mutation payload size

Description

The mutation payload can be a lot higher than a query payload, due to the possibility of passing a large transaction data (e.g., publishing a package). This PR adds a different check for when a mutation is requested, and adds a max_tx_payload_size variable that holds the max bytes that can be sent through a GraphQL mutation request. The sum of txBytes + signatures in a GraphQL mutation or txBytes in a dryRunTransactionBlock query have to be below the max_tx_payload_size.

The max_tx_payload_size is computed based on the protocol_version -> max_tx_bytes and a Base64 overhead as follows: max_tx_bytes * 4 / 3

Test plan

Added several tests. cd crates/sui-graphql-rpc cargo nextest run --features pg_integration -- test_transaction_dry_run cargo nextest run --features pg_integration -- test_query_mutation

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

[ ] Protocol:
[ ] Nodes (Validators and Full nodes):
[ ] Indexer:
[ ] JSON-RPC:
[x] GraphQL: Added a max_tx_payload_size variable to protect against large transaction queries. The sum of txBytes + signatures in a GraphQL mutation executeTransactionBlock or txBytes in a dryRunTransactionBlock query have to be below the max_tx_payload_size. The max_tx_payload_size is computed based on the protocol_version -> max_tx_bytes and a Base64 overhead as follows: max_tx_bytes * 4 / 3 Added also a check that the overall query size is not larger than max_tx_payload_size + max_query_payload_size, where max_query_payload_size is the read part of the query.
[ ] CLI:
[ ] Rust SDK:

May 31 '24 19:05 stefan-mysten

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 30, 2024 10:47pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Aug 30, 2024 10:47pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Aug 30, 2024 10:47pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Aug 30, 2024 10:47pm

May 31 '24 19:05 vercel[bot]

@stefan-mysten is attempting to deploy a commit to the Mysten Labs Team on Vercel.

A member of the Team first needs to authorize it.

May 31 '24 19:05 vercel[bot]

@amnn not quite sure how to make the default function to work correctly when a max_query_payload_size is set in the config toml file, as this will change the mutation payload size value.

Use cases: 🟢 no config toml, everything is loaded from default, all good 🔴 pass query payload size in config toml, the max mutation will not be correct. We need a way to update it if payload size is set and not default, but we cannot sum up different values in the impl function

    async fn max_mutation_payload_size(&self) -> u32 {
        self.limits.max_mutation_payload_size (+ self.limits.max_query_payload_size + signature_payload_size)
    }

because of use case below (if one sets mutation payload size in config toml, its value should include all payloads, thus we might double count). 🟡 pass query payload size and mutation payload size, all good as the mutation payload should contain both. However, I see how this field might be a bit confusing, as it should include all 3 payload sizes: max allowed tx bytes + overhead + query payload size + signature size. For the previous case, it should only include the maximum tx bytes allowed + base64 overhead, so yeah, it might have hiccups.

Jun 24 '24 17:06 stefan-mysten

Re: your question about interaction between mutation size and query payload size. I think your current approach (to treat the two as "independent" variables) is best (least confusing).

Yes, but the current code will not be correct if a configuration has the max_query_payload_size variable set, but not the mutation payload variable. The mutation payload variable needs to take into account that value, only if that value is set manually. This should happen either in the default function or somewhere else after config is read and initialized to ensure its correct value.

Jun 24 '24 19:06 stefan-mysten

Yes, but the current code will not be correct if a configuration has the max_query_payload_size variable set, but not the mutation payload variable. The mutation payload variable needs to take into account that value, only if that value is set manually, but not if it is not set....

You are saying that if only the max_query_payload_size is set then we aren't bumping up the default max_mutation_payload_size accordingly? If so, my position is that it's okay, and that's part of understanding the limits being changed.

Jun 24 '24 19:06 amnn

Yes, but the current code will not be correct if a configuration has the max_query_payload_size variable set, but not the mutation payload variable. The mutation payload variable needs to take into account that value, only if that value is set manually, but not if it is not set....

You are saying that if only the max_query_payload_size is set then we aren't bumping up the default max_mutation_payload_size accordingly? If so, my position is that it's okay, and that's part of understanding the limits being changed.

Correct. We don't have a very good docs on this config limits yet, maybe that will alleviate create problems. I hope though most RPCs use the default ones or > than defaults, as otherwise might create some issues if folks switch (often) between RPCs.

Jun 24 '24 20:06 stefan-mysten

Found out that the protocol config max_tx_bytes actually includes the signature, so we don't need to check for the signatures. So we have two options:

separate tx bytes and signatures, and consider having signatures as part of the max query payload size?
separate tx bytes and signatures, and have some contraints on how many signatures are allowed?

Jun 24 '24 22:06 stefan-mysten

@amnn some more development here, I think it's almost there. I need to figure out one more thing (how to match variables' names with txBytes and signatures, and ignore the rest of the defined variables, for counting the mutation payload bytes).

I think I found a bug: query.len() does not resolve the variables in the query - shouldn't this read query payload be evaluated after the variables are resolved in the query? Basically, the fix is simple: just call stringify on the doc + variables. Not sure how it works with fragments, I'll have to poke at it tomorrow.

Jun 25 '24 04:06 stefan-mysten

Found out that the protocol config max_tx_bytes actually includes the signature, so we don't need to check for the signatures.

Okay, that's good to know -- I don't think I understand the options you're suggesting though (to separate the two fields, or to treat signatures as part of the query payload instead of the mutation payload). To me this information indicates that when we set the default limit, we just don't add the component that was being added specially for signatures, and when we remove elements to check that the query payload, less the transaction payload is still within the bounds of the query payload limit, we still remove the signature, as was originally the plan.

I think I found a bug: query.len() does not resolve the variables in the query - shouldn't this read query payload be evaluated after the variables are resolved in the query? Basically, the fix is simple: just call stringify on the doc + variables. Not sure how it works with fragments, I'll have to poke at it tomorrow.

Nice find, I think it is indeed a bug to not measure the footprint of the variables, but I don't think stringifying after resolution is the right move, because:

It may end up charging for the size of a variable's value however many times it was used (we want to measure the cost exactly on ingress, so the cost of the variables JSON object and the cost of the query JSON object).
If someone is trying to OOM the service, stringifying a large query in order to test its size makes it easier for them to do that, because we are at least doubling the memory footprint of a large query. We want to perform this size check before we start performing operations that are proportional to the size of the query.

One thought I had (not fully fleshed out) was to implement this test (different limits for queries and mutations, but the query limit also applies to the query part of the mutation) in two steps:

At the query limit checker level, you do the coarse check of overall footprint. This applies a different limit by operation, and it should take the size of variables into account as well, but it doesn't do the additional query limit check for mutations.
The end-points that are responsible for accepting transactions do the additional check, before issuing the transaction, that if you take out the transaction bytes and signatures, the remaining size is within the query limit (this requires something to stash the overall query size to debit the transaction/signature bytes from).

As I've said this though, I've spotted a couple of potential pitfalls:

dryRunTransactionBlock is a query, but it needs to have the same limits applied to it.
You can certainly have multiple dry-runs, in a single transaction, you can possibly issue multiple transaction executions too, so the strategy of each one subtracting its transaction size and checking the remaining size against limits may not work, because the "remaining size" includes the footprint of other transactions/signatures.

I don't have great solutions for these pitfalls, but wanted to flag them.

Jun 25 '24 16:06 amnn

Thanks for your thoughts @amnn. I will revisit this logic and try to cover all the cases.

The end-points that are responsible for accepting transactions do the additional check

I do not follow why this is better at the endpoints than in the query limits logic checker. I think we can iterate through the nodes in the mutation and get all the executeTransactionBlock nodes, such that we can check if the txBytes for each executeTransactionBlock is within accepted limits, and if the overall read part of the whole query is within limits.

Jun 25 '24 20:06 stefan-mysten

I do not follow why this is better at the endpoints than in the query limits logic checker.

My thinking was that if you do the query payload check for mutations at that stage, you can take advantage of the fact async-graphql will have resolved variables for you, done type checks, etc. Otherwise the checks that you add yourself will have to repeat the same.

The downside is that it distributes parts of essentially the same check across multiple files which is not ideal. Maybe there's a middle ground where it's added to the query limits checker, but at a different stage (on a different extension hook)?

Jun 25 '24 21:06 amnn

Ah thanks, now that makes sense! Ideally, yes, we should keep it all in the query limits checker. Let me look around!

Jun 25 '24 22:06 stefan-mysten

@amnn I think I got the mutation + read part in a mutation scenario correct now. What's remaining is

how to handle variables and their sizes if they are not strings (e.g., passing filters ). We could just transform everything to string and then do a .len() on that, given that that's how we were checking query lengths. I tried something with transforming variables into JSON objects and extracting their bytes but I don't think that's the way to go.
decide if signatures should be included as part of the read query payload or not. As of now, they're excluded.
update all the doc comments to reflect what exactly is each payload measuring.

Jun 26 '24 04:06 stefan-mysten

Will add a comprehensive test today to ensure correctness in different cases, and after that I'll ping you @amnn to see if you have any remaining comments.

Jun 28 '24 16:06 stefan-mysten

I think we could use RequestBodyLimit layer from tower and add it in our route builder, but unfortunately in this version it requires the request to be of type http_body::Limited<hyper::body::Body>> which brings a bunch of other problems.

I tried using DefaultBodyLimit but I think that does not get called by all requests, but only in extractors that implement FromRequest and explicitly call another function to set this limit.

I also tried updating axum and hyper but it gets gnarly, so we'll set that aside for a moment.

I was thinking we could use Content-length header that has the body length, and we can have either a middle layer or directly in the graphql_handler that does this initial check.

Jul 20 '24 01:07 stefan-mysten

Thanks for looking into this @stefan-mysten -- echoing our conversation elsewhere: Using Content-Length is a great idea -- and doing that by adding middleware to capture the content length, but still performing the check itself in the GraphQL extension sounds like the right approach to me!

Jul 24 '24 13:07 amnn

Thanks for looking into this @stefan-mysten -- echoing our conversation elsewhere: Using Content-Length is a great idea -- and doing that by adding middleware to capture the content length, but still performing the check itself in the GraphQL extension sounds like the right approach to me!

We probably don't need a middleware to capture content length - we can just use the TypedHeader extractor in the GraphQL handler and store it in the data context. I have made the changes but not dealt yet with how to handle variables correctly. Will push them today!

Jul 24 '24 15:07 stefan-mysten

Refactored the code. This is hands down the worst code I ever wrote 😭

Aug 15 '24 01:08 stefan-mysten

@amnn I addressed all comments except the more tests one. Will try to get to it in the morning, but I might not be able to.

Aug 24 '24 05:08 stefan-mysten

Ended up making quite a few changes/refactors. Detailed in the commit message from me, but I'll copy it here for reference:

Move the overall (initial) payload size check to happen before the document is parsed. Doing it after defeats the point of the check, because we end up doing work proportional to the size we're checking before we check it.
Similarly, avoid copies as another example of work that is proportional to the size that we are checking, before we complete the check.
Perform the transaction payload check after the inputs test. The former is implemented recursively, which means if it runs before the input nodes check (which limits query depth), it could stack overflow.
Incorporate this check into LimitsTraversal, rather than as a separate recursive function. This allows it to take advantage of the context that it carries around, rather than passing multiple parameters down recursively.
Removed the query_payload_too_large_size metric, because it uses the same histogram buckets as the query_payload_size metric which we are already logging to, so we don't gain any fidelity by tracking it separately.
Fixed handling of variables, which was not taking into account variables that could show up embedded within other GraphQL expressions.
Standardise names of tests so they are easier to search, and to selectively run in the test runner.
Implement tests that use variables but are not expected to pass limits checks in a way that does not require running the test cluster. This way they are faster to run, and avoid issues related to hanging indefinitely.

The additional tests are to come, still.

Aug 26 '24 14:08 amnn

The additional tests are to come, still.

Tests now added, this PR should be good to go now.

Aug 27 '24 09:08 amnn

Thanks @amnn and @wlmyng for finishing this PR! 🙇

Sep 11 '24 21:09 stefan-mysten

sui sui copied to clipboard

[GraphQL] Add a mutation payload size

Description

Test plan

Release notes

sui
sui copied to clipboard