graph-node
graph-node copied to clipboard
graph,graphql: Query.hash to represent a normalized query
Uses Query.hash as the cache key of #3759
The idea here is to avoid misses and increase the hit rate for the GraphQL validation's cache by normalizing GraphQL operations.
The normalization includes:
- sorting of fields, fragments, arguments, fragment definitions etc
- making operation names identical
- transforming the selection sets like
{ things }intoquery normalized { things } - removing primitive values (like String, Int, Float except Boolean because it can change the body of the operation when used with
@includeor@skip) - removing aliases
I just had a brief look at that, and it looks like it's mostly rewriting the query which requires allocations; is there a way to write this as a visitor without allocations that passes a Hasher around and just hashes things according to what's in the query, e.g., when it sees an Int, it adds a 0 to the hasher?
I'd need to look more into it, but it should be possible to (ab)use stable_hash and also avoid sorting by using it's unordered() functionality
@That3Percent I just looked at the StableHash docs, and can't quite figure out the right way to do this with its API. Could you comment here how we could use it to avoid sorting? I.e., what's the right way to add a list of things to StableHash so that order of the things does not affect the hash?
@lutter
I just looked at the
StableHashdocs, and can't quite figure out the right way to do this with its API. Could you comment here how we could use it to avoid sorting?
I agree it is a good idea to avoid clones and sorting generally. I didn't carefully check this particular case, but a nested transform/clone loop has the danger of being N**2 complexity since the innermost items can be cloned depth times.
There's a private API that we could expose for this purpose. The private API is here: https://github.com/graphprotocol/stable-hash/blob/7a3a2969e05a3e373dd53ec41786ad52b3887d17/src/impls/mod.rs#L13 and looks like this:
pub(self) fn unordered_unique_stable_hash<H: StableHasher>(
items: impl Iterator<Item = impl StableHash>,
field_address: H::Addr,
state: &mut H,
) { ... }
Would you like for stable-hash to expose a wrapper like so...?
struct Unordered<T>(T);
impl<T, I> StableHash for T where T: IntoIter<Item=I>, I: StableHash {..}
Then you could do something like...
Unordered(&self.selection_set)).stable_hash(field_address, state);
Within a StableHash impl for a newtype like NormalizedQuery(&Query)
I think I would technically have to make a major version bump for this, because nothing documents that StableHasher need support multi-sets. The major version bump wouldn't affect the API, so graph-node could do a very simple upgrade.
@lutter @kamilkisiela
It's difficult to reconcile these two: Use stable hash to avoid cloning Level 2 hash required for security
stable-hash gives you a choice of either a Level 1 hash or a Level 5 hash, but not a 2. If you choose the Level 5 hash it would come with a severe performance penalty. I'm not confident I can make a Level 2 implementation of StableHasher. I've got some ideas of how I might go about it, but don't want to have our security rest on unverified ideas.
I might instead suggest sorting but not cloning. For example, you can buffer references to parts of the query using second-stack very cheaply and then sort the references. You can pass the hasher in and feed it pieces of the query as you go rather than ever calling to_string I think you can make this quite fast but it will take some care. Let me know if the approach I'm describing is not clear, I can help if needed as well.
@lutter @kamilkisiela
For a Level 2 hash function, I'd recommend this one: XXH3 128 with secret using a 192 byte secret initialized as a lazy_static. It operates at RAM speeds and seemingly has no weaknesses.
@lutter I renamed these:
- query_hash
+ validation_hash
- Query.hash
+ Query.validation_hash
and took the iteration logic (visitor) from the shape_hash function.
@That3Percent I tried to use second-stack but I'm not sure if I use it correctly.
It feels wrong to wrap everything with a buffer. I tried to return the slice in the buffer function not sure if that's correct.
Before
let mut next_difinitions = self.definitions.clone();
next_difinitions.sort_unstable_by(|a, b| compare_definitions(a, b));
for defn in &next_difinitions {
match defn {
q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
}
}
After
buffer(self.definitions.iter(), |slice| {
slice.sort_unstable_by(|a, b| compare_definitions(a, b));
for defn in slice {
match defn {
q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
}
}
});
@kamilkisiela
I tried to use
second-stackbut I'm not sure if I use it correctly. It feels wrong to wrap everything with abuffer. I tried to return the slice in the buffer function not sure if that's correct.
The usage pattern in "After" is correct. The memory is freed once the buffer function returns, so it's not possible to return the slice. Wrapping in buffer is the only way to provide this fast API in safe Rust because we have to guarantee that buffers are freed in the reverse order they are created.
Ready for a review
Closing for now, we'll restart this if needed.