graph-node graph,graphql: Query.hash to represent a normalized query

Uses Query.hash as the cache key of #3759

The idea here is to avoid misses and increase the hit rate for the GraphQL validation's cache by normalizing GraphQL operations.

The normalization includes:

sorting of fields, fragments, arguments, fragment definitions etc
making operation names identical
transforming the selection sets like { things } into query normalized { things }
removing primitive values (like String, Int, Float except Boolean because it can change the body of the operation when used with @include or @skip)
removing aliases

Aug 15 '22 11:08 kamilkisiela

I just had a brief look at that, and it looks like it's mostly rewriting the query which requires allocations; is there a way to write this as a visitor without allocations that passes a Hasher around and just hashes things according to what's in the query, e.g., when it sees an Int, it adds a 0 to the hasher?

I'd need to look more into it, but it should be possible to (ab)use stable_hash and also avoid sorting by using it's unordered() functionality

Aug 16 '22 16:08 lutter

@That3Percent I just looked at the StableHash docs, and can't quite figure out the right way to do this with its API. Could you comment here how we could use it to avoid sorting? I.e., what's the right way to add a list of things to StableHash so that order of the things does not affect the hash?

Aug 16 '22 16:08 lutter

@lutter

I just looked at the StableHash docs, and can't quite figure out the right way to do this with its API. Could you comment here how we could use it to avoid sorting?

I agree it is a good idea to avoid clones and sorting generally. I didn't carefully check this particular case, but a nested transform/clone loop has the danger of being N**2 complexity since the innermost items can be cloned depth times.

There's a private API that we could expose for this purpose. The private API is here: https://github.com/graphprotocol/stable-hash/blob/7a3a2969e05a3e373dd53ec41786ad52b3887d17/src/impls/mod.rs#L13 and looks like this:

pub(self) fn unordered_unique_stable_hash<H: StableHasher>(
    items: impl Iterator<Item = impl StableHash>,
    field_address: H::Addr,
    state: &mut H,
) { ... }

Would you like for stable-hash to expose a wrapper like so...?

struct Unordered<T>(T);
impl<T, I> StableHash for T where T: IntoIter<Item=I>, I: StableHash {..}

Then you could do something like... Unordered(&self.selection_set)).stable_hash(field_address, state);

Within a StableHash impl for a newtype like NormalizedQuery(&Query)

I think I would technically have to make a major version bump for this, because nothing documents that StableHasher need support multi-sets. The major version bump wouldn't affect the API, so graph-node could do a very simple upgrade.

Aug 16 '22 20:08 That3Percent

@lutter @kamilkisiela

It's difficult to reconcile these two: Use stable hash to avoid cloning Level 2 hash required for security

stable-hash gives you a choice of either a Level 1 hash or a Level 5 hash, but not a 2. If you choose the Level 5 hash it would come with a severe performance penalty. I'm not confident I can make a Level 2 implementation of StableHasher. I've got some ideas of how I might go about it, but don't want to have our security rest on unverified ideas.

I might instead suggest sorting but not cloning. For example, you can buffer references to parts of the query using second-stack very cheaply and then sort the references. You can pass the hasher in and feed it pieces of the query as you go rather than ever calling to_string I think you can make this quite fast but it will take some care. Let me know if the approach I'm describing is not clear, I can help if needed as well.

Aug 16 '22 20:08 That3Percent

@lutter @kamilkisiela

For a Level 2 hash function, I'd recommend this one: XXH3 128 with secret using a 192 byte secret initialized as a lazy_static. It operates at RAM speeds and seemingly has no weaknesses.

Aug 16 '22 21:08 That3Percent

@lutter I renamed these:

- query_hash
+ validation_hash

- Query.hash
+ Query.validation_hash

and took the iteration logic (visitor) from the shape_hash function.

@That3Percent I tried to use second-stack but I'm not sure if I use it correctly. It feels wrong to wrap everything with a buffer. I tried to return the slice in the buffer function not sure if that's correct.

Before

let mut next_difinitions = self.definitions.clone();
next_difinitions.sort_unstable_by(|a, b| compare_definitions(a, b));

for defn in &next_difinitions {
  match defn {
    q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
    q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
  }
}

After

buffer(self.definitions.iter(), |slice| {
  slice.sort_unstable_by(|a, b| compare_definitions(a, b));
  for defn in slice {
    match defn {
      q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
      q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
    }
  }
});

Aug 18 '22 11:08 kamilkisiela

@kamilkisiela

I tried to use second-stack but I'm not sure if I use it correctly. It feels wrong to wrap everything with a buffer. I tried to return the slice in the buffer function not sure if that's correct.

The usage pattern in "After" is correct. The memory is freed once the buffer function returns, so it's not possible to return the slice. Wrapping in buffer is the only way to provide this fast API in safe Rust because we have to guarantee that buffers are freed in the reverse order they are created.

Aug 18 '22 12:08 That3Percent

Ready for a review

Sep 07 '22 09:09 kamilkisiela

Closing for now, we'll restart this if needed.

Sep 29 '22 09:09 dotansimha

graph-node graph-node copied to clipboard

graph,graphql: Query.hash to represent a normalized query

graph-node
graph-node copied to clipboard