atproto icon indicating copy to clipboard operation
atproto copied to clipboard

Agent Creation has a large CPU time

Open wesbos opened this issue 1 year ago • 15 comments

Not sure if this is a real issue, but I thought I'd open a discussion around how long creating an agent takes.

Creation of a BSkyAgent takes over 112ms on my Macbook M1.

console.time('start');
import { BskyAgent } from "@atproto/api";
const agent = new BskyAgent({ service: "https://bsky.social" });
console.timeEnd("start");

This code isn't hitting the network, it's simply the initialization of the class.

I ran into this issue trying to use this library on Cloudflare workers, which limits your requests to 10ms CPU time.

They also offer 200ms script startup, and 50ms CPU time on pro plans. Both would go over.

Deno deploy has the same 10ms free, 50ms paid limitations.

So, is there anything that can be done to decrease this time? Or is the answer to just use a service that allows for longer run times?

wesbos avatar Apr 27 '23 16:04 wesbos

The initialization doesn’t take anything for me in Deno on M2.

roj1512 avatar Apr 27 '23 18:04 roj1512

What is anything? Have you measured or deployed to Deno Deploy?

wesbos avatar Apr 27 '23 20:04 wesbos

Oh, right. I was wrong.

 $ time deno eval "import mod from \"npm:@atproto/api\";new mod.BskyAgent({ service: \"https://bsky.social\" });"
deno eval   0.65s user 0.06s system 134% cpu 0.528 total

roj1512 avatar Apr 27 '23 21:04 roj1512

Now, I am not sure neither I think is my expertise, but I believe it might be related to the large number of imports it makes.

roj1512 avatar Apr 27 '23 21:04 roj1512

I thought so, but the shipped version is compiled and has only 1 import. I think it's due to all the encrypt / decrypt, but also not my area of expertise

wesbos avatar Apr 27 '23 21:04 wesbos

This is a common problem for API wrappers regardless of language, but the initialization cost tends to be heavy because of the high-end implementation. It would be interesting to see if there is a design pattern that could significantly improve this cost, because I also develop API wrapper in Dart.

myConsciousness avatar Apr 27 '23 23:04 myConsciousness

I can look into this this weekend

Evalprime avatar May 05 '23 21:05 Evalprime

FWIW: I recorded a profile and it seems like almost the entirety of the CPU time is dedicated to the addLexicons method where it's parsing the lexicon docs at runtime. IMO this should be done ahead of time because the nearly 300ms of CPU time just to initialize an API client is kind of ridiculous.

Screenshot 2023-05-08 at 14 59 00

vincentriemer avatar May 08 '23 22:05 vincentriemer

Investigated a little further and determined if you remove these lines it completely resolves the issue and doesn't break anything: https://github.com/bluesky-social/atproto/blob/41793b7ff9a47155c6c1bc598893748692033b87/packages/lexicon/src/lexicons.ts#L47-L61

I get that the validation provides some degree of safety but that should really be done at build time and not at runtime (especially not for each instantiation of a client).

vincentriemer avatar May 08 '23 23:05 vincentriemer

in the xRPC package, there's an addLexicon method for the client and the server that uses for of which is pretty slow in comparison to for or even for each. but building off of what @vincentriemer said, zod has pretty bad performance at scale. I think ajv is better at that but obviously less well known than zod. I don't want to get rid of a ton of lines like that, might break something

Evalprime avatar May 09 '23 15:05 Evalprime

This is somewhat related, I've run the profiler on https://staging.bsky.app and on my machine there's a 2s period where the page is just blank, during that time over a second was spent in zod. Looking at the whole profile zod is by far the most expensive part.

image

Matrix89 avatar May 09 '23 18:05 Matrix89

oh, ok, I think I see what's happening here. zod is used to validate/parse large objects which it doesn't do well at. as a temp solution, we could use David Desmarais-Michaud's myzod to replace it. I will make a pr shortly and we can benchmark and see how long it takes

Evalprime avatar May 09 '23 20:05 Evalprime

I did a little digging and as we suspected the issue is with zod, to be precise with how zod handles unions and failures.

Let's start with how zod handles unions here is the union code, as expected zod tries to match every possible union option and it returns on the first valid one. The problem arises when we don't return early(eg. no case status === 'valid') which seems to be quite common for lexicon unions. When the options loop finishes zod maps all the issues to ZodError.

The unfortunate ZodError is declared as ZodError extends Error, extending Error is not fun, it adds a lot of overhead because node has to gather the stack trace and do some other magic.

So that's why agent creating is soooooo damn slow.

Let's run through some numbers, on #1015 I've added a benchmark for BskyAgent creation. I'll do a very scientific test of removing extends Error from the ZodError class.

Case Sample Size Time
ZodError extends Error 100 ~12820ms
ZodError 100 ~5988ms
ZodError extends Error 10 ~615ms
ZodError 10 ~1269ms

As we can see the difference is huge, obviously removing random code from a library is not a solution, but this should help to highlight that the issue is nested deeply inside zod and until zod fixes it, which I don't think is possible since I think they want to preserve the stack trace and well you can't generate one lazily, the solution to this issue is to either replace zod entirely or to not check lexicons schema at runtime.

Edit: just wanted to add that skipping schema validation(basically commenting out this takes down the creation time of 10 bsky agents from 1200ms to 10ms

Matrix89 avatar May 09 '23 22:05 Matrix89

With minimal changes I've managed to cut the time by around 4 times #1018

Matrix89 avatar May 09 '23 23:05 Matrix89

fantastic work everyone on digging into this - I learned a lot here

wesbos avatar May 10 '23 14:05 wesbos

I did a little digging and as we suspected the issue is with zod, to be precise with how zod handles unions and failures.

Let's start with how zod handles unions here is the union code, as expected zod tries to match every possible union option and it returns on the first valid one. The problem arises when we don't return early(eg. no case status === 'valid') which seems to be quite common for lexicon unions. When the options loop finishes zod maps all the issues to ZodError.

The unfortunate ZodError is declared as ZodError extends Error, extending Error is not fun, it adds a lot of overhead because node has to gather the stack trace and do some other magic.

So that's why agent creating is soooooo damn slow.

Let's run through some numbers, on #1015 I've added a benchmark for BskyAgent creation. I'll do a very scientific test of removing extends Error from the ZodError class.

Case Sample Size Time ZodError extends Error 100 ~12820ms ZodError 100 ~5988ms ZodError extends Error 10 ~615ms ZodError 10 ~1269ms As we can see the difference is huge, obviously removing random code from a library is not a solution, but this should help to highlight that the issue is nested deeply inside zod and until zod fixes it, which I don't think is possible since I think they want to preserve the stack trace and well you can't generate one lazily, the solution to this issue is to either replace zod entirely or to not check lexicons schema at runtime.

Edit: just wanted to add that skipping schema validation(basically commenting out this takes down the creation time of 10 bsky agents from 1200ms to 10ms

Has this been reported to Zod maintainers?

VirtuaBoza avatar May 10 '23 22:05 VirtuaBoza

@VirtuaBoza No, as said I don't think anything major can be done(although this I think somewhat addresses the issue, also #1021). Overall most of the TS community seems to agree that zod isn't the most performant runtime type checking library, but I'm yet to find one that is as featureful and easy to use

Matrix89 avatar May 10 '23 22:05 Matrix89

We also just landed #1021, which should help. Anyone interested in doing a quick bench from main to see where we're at?

devinivy avatar May 11 '23 14:05 devinivy

@devinivy just rebased #1015 onto main

 PASS   API  bench/agent.bench.ts
  Agent Benchmarks
    ✓ Creates new Agent instance 10 times (342 ms)

~34ms per instance, sounds like an improvement

Matrix89 avatar May 11 '23 15:05 Matrix89

Closing this because we've made an order of magnitude improvement with everyone's help, and I don't think there are additional actions to take at the moment. But we'll be keeping an eye on perf around zod/validation and totally appreciate additional insights! All good if discussion continues here.

devinivy avatar May 11 '23 15:05 devinivy