atproto
atproto copied to clipboard
Agent Creation has a large CPU time
Not sure if this is a real issue, but I thought I'd open a discussion around how long creating an agent takes.
Creation of a BSkyAgent takes over 112ms on my Macbook M1.
console.time('start');
import { BskyAgent } from "@atproto/api";
const agent = new BskyAgent({ service: "https://bsky.social" });
console.timeEnd("start");
This code isn't hitting the network, it's simply the initialization of the class.
I ran into this issue trying to use this library on Cloudflare workers, which limits your requests to 10ms CPU time.
They also offer 200ms script startup, and 50ms CPU time on pro plans. Both would go over.
Deno deploy has the same 10ms free, 50ms paid limitations.
So, is there anything that can be done to decrease this time? Or is the answer to just use a service that allows for longer run times?
The initialization doesn’t take anything for me in Deno on M2.
What is anything? Have you measured or deployed to Deno Deploy?
Oh, right. I was wrong.
$ time deno eval "import mod from \"npm:@atproto/api\";new mod.BskyAgent({ service: \"https://bsky.social\" });"
deno eval 0.65s user 0.06s system 134% cpu 0.528 total
Now, I am not sure neither I think is my expertise, but I believe it might be related to the large number of imports it makes.
I thought so, but the shipped version is compiled and has only 1 import. I think it's due to all the encrypt / decrypt, but also not my area of expertise
This is a common problem for API wrappers regardless of language, but the initialization cost tends to be heavy because of the high-end implementation. It would be interesting to see if there is a design pattern that could significantly improve this cost, because I also develop API wrapper in Dart.
I can look into this this weekend
FWIW: I recorded a profile and it seems like almost the entirety of the CPU time is dedicated to the addLexicons
method where it's parsing the lexicon docs at runtime. IMO this should be done ahead of time because the nearly 300ms of CPU time just to initialize an API client is kind of ridiculous.
![Screenshot 2023-05-08 at 14 59 00](https://user-images.githubusercontent.com/1398555/236947822-c13e216a-7485-45e5-8ad6-5ff889a8efe4.png)
Investigated a little further and determined if you remove these lines it completely resolves the issue and doesn't break anything: https://github.com/bluesky-social/atproto/blob/41793b7ff9a47155c6c1bc598893748692033b87/packages/lexicon/src/lexicons.ts#L47-L61
I get that the validation provides some degree of safety but that should really be done at build time and not at runtime (especially not for each instantiation of a client).
in the xRPC package, there's an addLexicon method for the client and the server that uses for of which is pretty slow in comparison to for or even for each. but building off of what @vincentriemer said, zod has pretty bad performance at scale. I think ajv is better at that but obviously less well known than zod. I don't want to get rid of a ton of lines like that, might break something
This is somewhat related, I've run the profiler on https://staging.bsky.app
and on my machine there's a 2s period where the page is just blank, during that time over a second was spent in zod. Looking at the whole profile zod is by far the most expensive part.
oh, ok, I think I see what's happening here. zod is used to validate/parse large objects which it doesn't do well at. as a temp solution, we could use David Desmarais-Michaud's myzod to replace it. I will make a pr shortly and we can benchmark and see how long it takes
I did a little digging and as we suspected the issue is with zod, to be precise with how zod handles unions and failures.
Let's start with how zod handles unions here is the union code, as expected zod tries to match every possible union option and it returns on the first valid one. The problem arises when we don't return early(eg. no case status === 'valid'
) which seems to be quite common for lexicon unions. When the options loop finishes zod maps all the issues to ZodError.
The unfortunate ZodError is declared as ZodError extends Error
, extending Error is not fun, it adds a lot of overhead because node has to gather the stack trace and do some other magic.
So that's why agent creating is soooooo damn slow.
Let's run through some numbers, on #1015 I've added a benchmark for BskyAgent
creation.
I'll do a very scientific test of removing extends Error
from the ZodError class.
Case | Sample Size | Time |
---|---|---|
ZodError extends Error |
100 | ~12820ms |
ZodError |
100 | ~5988ms |
ZodError extends Error |
10 | ~615ms |
ZodError |
10 | ~1269ms |
As we can see the difference is huge, obviously removing random code from a library is not a solution, but this should help to highlight that the issue is nested deeply inside zod and until zod fixes it, which I don't think is possible since I think they want to preserve the stack trace and well you can't generate one lazily, the solution to this issue is to either replace zod entirely or to not check lexicons schema at runtime.
Edit: just wanted to add that skipping schema validation(basically commenting out this takes down the creation time of 10 bsky agents from 1200ms to 10ms
With minimal changes I've managed to cut the time by around 4 times #1018
fantastic work everyone on digging into this - I learned a lot here
I did a little digging and as we suspected the issue is with zod, to be precise with how zod handles unions and failures.
Let's start with how zod handles unions here is the union code, as expected zod tries to match every possible union option and it returns on the first valid one. The problem arises when we don't return early(eg. no case
status === 'valid'
) which seems to be quite common for lexicon unions. When the options loop finishes zod maps all the issues to ZodError.The unfortunate ZodError is declared as
ZodError extends Error
, extending Error is not fun, it adds a lot of overhead because node has to gather the stack trace and do some other magic.So that's why agent creating is soooooo damn slow.
Let's run through some numbers, on #1015 I've added a benchmark for
BskyAgent
creation. I'll do a very scientific test of removingextends Error
from the ZodError class.Case Sample Size Time
ZodError extends Error
100 ~12820msZodError
100 ~5988msZodError extends Error
10 ~615msZodError
10 ~1269ms As we can see the difference is huge, obviously removing random code from a library is not a solution, but this should help to highlight that the issue is nested deeply inside zod and until zod fixes it, which I don't think is possible since I think they want to preserve the stack trace and well you can't generate one lazily, the solution to this issue is to either replace zod entirely or to not check lexicons schema at runtime.Edit: just wanted to add that skipping schema validation(basically commenting out this takes down the creation time of 10 bsky agents from 1200ms to 10ms
Has this been reported to Zod maintainers?
@VirtuaBoza No, as said I don't think anything major can be done(although this I think somewhat addresses the issue, also #1021). Overall most of the TS community seems to agree that zod isn't the most performant runtime type checking library, but I'm yet to find one that is as featureful and easy to use
We also just landed #1021, which should help. Anyone interested in doing a quick bench from main
to see where we're at?
@devinivy just rebased #1015 onto main
PASS API bench/agent.bench.ts
Agent Benchmarks
✓ Creates new Agent instance 10 times (342 ms)
~34ms per instance, sounds like an improvement
Closing this because we've made an order of magnitude improvement with everyone's help, and I don't think there are additional actions to take at the moment. But we'll be keeping an eye on perf around zod/validation and totally appreciate additional insights! All good if discussion continues here.