Ultra-slow first request performance
Describe the bug We have been using EdgeDB in production for a while, and a few months ago updated our NodeJS version from v16 to v18 and more on multiple projects. We have been experiencing ultra-slow first request performances, every time our EdgeDB client is trying to execute a request after a while.
Reproduction As this is a performance issue, I guess it might be kinda hard to replicate. We have a small DB with 56 types and 16380 objects, and this way of running requests.
import { createClient } from '~/edgedb';
import * as EdgeDB from '.../edgedb.ts';
// Single use client
request.run(createClient());
// Or using a shared client (imported from 'edgedb.ts')
request.run(EdgeDB.client);
Expected behavior Initial request should not take up to 50x more time to run than following requests.
Versions
- OS: MacOS / Linux
- EdgeDB version: 4.5+641a8f3
- EdgeDB CLI version: 4.1.1+a774177
edgedb-jsversion: 1.4.1
We tried a lot of Node versions such as v18.16.1, v19.8.1, v20.8.0, v20.11.0, v21.6.0. We also tried Bun v.1.1.0 and got the same results.
Thanks for your help!
As an example of what's happening, here are the comparison of request time between Node versions, from Node v16 to the left to the newest versions down the line.
Everytime the server idles a bit, you can see this kind of metrics when a new request comes in.
How are you running the edgedb server? If you're using the edgedb cli tool to create the instance, then this is the expected behaviour. Since instances created with the edgedb cli are intended for local development, and we don't want to use all your system resources if you have a lot of instances, the server is configured to shutdown after being idle for some time, and will restart using socket activation on the next query. For production use you should setup the edgedb server using one of these guides: https://docs.edgedb.com/guides/deployment, or use edgedb cloud.
We are running EdgeDB servers on bare metal Ubuntu instances. The EdgeDB server did not change between the Node 16 and above transitions we did.
I also noticed that this happens on the EdgeDB UI app side of things enabled on the production server. Could this be linked to the complexity of the DB and therefore be more appropriate to talk about this in the EdgeDB repo?
@scotttrinh any update on the subject?
@blecorre-mnp
No new updates to share here, but I'll spend a little time ASAP to try to get a reproduction together. If you have time to attempt getting a minimal reproduction together, that would definitely speed up the process of narrowing down what change in Node caused this sudden shift in behavior.
No worries, I'll try to provide a minimal example. I tried a few things this morning, and I might have found something interesting for you to look into:
Using EdgeDB's Trust connection method on the server (here) seems to have improved client initial connection time by quite a lot, although it is still somewhat unstable. My guess is that from what I read in the source code, using SCRAM authentication, the client is supposed to HMAC multiple times (4096 ?) the password and it might be what's causing this issue.
I'll let you look into this whenever you can, thanks a lot for your time!
My guess is that from what I read in the source code, using SCRAM authentication, the client is supposed to HMAC multiple times (4096 ?) the password and it might be what's causing this issue.
Interesting, yeah let's dive in there. I'm still not sure why this is a Node regression, but maybe there is some pathological situation where the SCRAM authentication is being invoked more than it ought to (like after some amount of time has elapsed, and maybe something get's garbage collected?).
Yes, it might be something like that. I'll keep you updated on what I find, I hope we'll get to the bottom of this!
I am seeing something very similar, and looking at some profiles for some of my requests it is saying that around 40% of my request time is being spent in the HMAC function. Turning the connection method to Trust like @blecorre-mnp says seems to have improved my connection time as well. I am on node 20.2 and would not be able to downgrade to 16 to test.
Oh, one thing that just occured to me is that Node 18 includes support for globalThis.crypto (behind a flag) which we will use if available. Perhaps the change from 16 -> 18+ meant that we went from using node:crypto for HMAC to using globalThis.crypto and there is a noticeable degradation of performance between those implementations. @ElijahJohnson5 since you recently profiled this, can you confirm that it's spending the time in the browserCrypto.HMAC function?
Oof, yeah, it seems like crypto.subtle.sign is way slower the the equivalent code in node:crypto:
┌─────────┬─────────────────────────────────────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼─────────────────────────────────────────┼───────────┼────────────────────┼──────────┼─────────┤
│ 0 │ 'node:crypto: createHmac' │ '402,187' │ 2486.4021204019537 │ '±5.53%' │ 402188 │
│ 1 │ 'globalThis.crypto: crypto.subtle.sign' │ '89,031' │ 11231.956656034428 │ '±0.22%' │ 89032 │
└─────────┴─────────────────────────────────────────┴───────────┴────────────────────┴──────────┴─────────┘
I think we can still make this way better by not building the key object on every iteration of the HMAC, but we also might just want to flip the logic to prefer node:crypto when available.
I took a CPU profile of an initial connection call, and here is what I got!
As you said @scotttrinh, i seems that despite running on my Macbook Pro M2 in 44.71ms, this is the longest step in the connection process?
EDIT: Didn't see it at first but every long transaction originates from the HMAC function!
@blecorre-mnp
The change in #1063 should make a pretty noticeable difference, but I think that I do want to swap the logic here, so I'll probably wait for that before cutting an actual release. Once that PR lands though, you can use a canary release to test it out.
It looks like in mine it is using the crypto.subtle.sign as the profile says that importKey is taking up most of the time, the module looks the same as @blecorre-mnp's profile for the HMAC internal/crypto/webcrypto
@blecorre-mnp @ElijahJohnson5
Just pushed #1063 and #1064 which should switch to the faster node:crypto implementation. Latest canary contains both of these changes:
You'll probably need to install both edgedb and @edgedb/generate because @edgedb/generate has a dependency on a change that's just in the canary release of edgedb at the moment.
If y'all have time lemme know if that resolves the issue.
It seems to be better for me now! Still trying to verify 100%
I'll test it right away. Thank you very much for you help @scotttrinh @ElijahJohnson5 !
@blecorre-mnp Thanks for your patience here. I should've remembered about 18 introducing the globalThis.crypto, but I am still very shocked at the 10x performance difference for HMAC here. 🤯
I can confirm that the issue appears to be resolved on our side using the latest canary. Yes that's kinda impressively bad for globalThis.crypto, they might have good reasons for doing this but still... 😅
v1.5.8 is out now which includes this change! Thanks everyone for their patience and help getting to the bottom of this head-scratcher!