umbriel
umbriel copied to clipboard
Performance issues
I sent an email for 18k contacts and noticed some CPU and memory problems.
Summary
- The HTTP service got a high CPU usage due to AWS SNS notifications;
- The queue service got a high CPU usage due to queue processing to send emails;
- The HTTP service got a high memory usage when adding the batch of jobs to the queue;
Errors
When receiving some AWS events, the HTTP service was logging some errors with this trace:
PrismaClientKnownRequestError2 [PrismaClientKnownRequestError]: Unique constraint failed on the fields: (`message_id`,`contact_id`)
at cb (/home/node/app/node_modules/@prisma/client/runtime/index.js:34800:17)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async PrismaRecipientsRepository.saveWithEvents (/home/node/app/dist/modules/broadcasting/repositories/prisma/PrismaRecipientsRepository.js:41:5)
at async RegisterEvent.execute (/home/node/app/dist/modules/broadcasting/useCases/RegisterEvent/RegisterEvent.js:60:5)
at async RegisterEventController.handle (/home/node/app/dist/modules/broadcasting/useCases/RegisterEvent/RegisterEventController.js:90:22)
at async /home/node/app/dist/infra/http/adapters/ExpressRouteAdapter.js:15:26 {
code: 'P2002',
clientVersion: '2.28.0',
meta: { target: [ 'message_id', 'contact_id' ] }
}
Remediations
Infrastructure
- [x] Split the HTTP service that receives SNS notifications in a new service with (100mCPU, 144mb RAM and 35% CPU autoscale threshold);
- [x] Increased the amount of CPU of Queue service to 400m CPU;
Application
- [ ] Do some benchmarks with Bee Queue module and different concurrency factors;
- [ ] Add jobs to queue in batches instead of creating a huge jobs array;
- [x] Try to reduce the amount of queries during notification receiving;
Did some benchmarking around queue job inserting using BullMQ and got the following results:
- Inserting 20k jobs in a single batch on the queue leads to 190mb RAM usage around 956ms;
- Inserting 20k jobs individually on the queue leads to 13mb RAM usage around 430ms (This uses high CPU, but maybe it's better to have higher CPU than memory usage);
Why inserting in batch is more expensive than inserting one by one? I don't know, but it is.
My guesses about the memory usage being around the creation of a big recipients array in the JavaScript side were wrong, the problem was inserting to queue.
Did some benchmarks with Bee Queue, but it didn't result in a better performance, actually it changed nothing.
Removed two queries when receiving notifications from Amazon SNS.
Now we don't need to check for recipient existence nor create the events separatelly from upserting the recipient.
Split the SNS webhook HTTP service on c81033171578d5859383b897fed4b41007bbff76.
@diego3g
Inserting 20k jobs individually on the queue leads to 13mb RAM usage around 430ms (This uses high CPU, but maybe it's better to have higher CPU than memory usage);
Touchy subject, memory is usually considerably cheaper than vCPU[0]. Some numbers around the CPU usage in both tests would be nice!
It's also easier to manage/scale/allocate memory than CPU. Besides all that, applications in Node.js tend to consume more CPU than memory. Since we [currently] work almost exclusively with this technology, depending on the usage of each resource, it may be worth to sacrifice some RAM. (:
[0]
Item | Price |
---|---|
vCPU | $0.033174 / vCPU hour |
Memory | $0.004446 / GB hour |
Hey @pellizzetti, I tried making some CPU benchmarking but I couldn't manage to get the actual results of CPU consumption. Do you have any tips for me on that?
@diego3g
If you created new, isolated scripts for your benchmarking, you won't need to handle startup/setup time, so, I guess a simple time
would do it, e.g time node <script>
. Just a quick reminder, it would be nice to run it some times to get the avg CPU usage of each script.
If you want to take a deeper dive, take some time to look over perf
/perf_events
[0].
[0] https://www.brendangregg.com/perf.html
thought this would be a fun quest, decided to jump in and I got your answer :)
Why inserting in batch is more expensive than inserting one by one? I don't know, but it is.
tldr; because extraneous loops allocating different objects and a lot of strings (JSON.stringify()) before sending to redis; when inserting only one you don't allocate all these objects, and you can save some resources with fewer interactions.
here is the full story:
-
calling
addBulk()
in BullMQ triggers multiple loops in the whole data, initializing a lot of new Objects. testing on my Macbook with simple 20k objects ({ foo: 'bar' }), only this loop takes approximately 150ms~ of pure CPU -
and it got worse in line #207
const result = (await multi.exec()) as [null | Error, string][];
this basically loops for all the 20k objects that we just added, checking some flags and building a scripts[]
array
- https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L321-L345
then it loops again all the 20k objects and sends one by one to redis
- https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L444-L446
so, there is a bunch of loops in all data:
- first the
addBulk()
itself contains a.map()
(which will loop all 20k objects) - https://github.com/taskforcesh/bullmq/blob/afdaac6b072c7af5973222cc7fb69f3f138f3b0b/src/classes/queue.ts#L211-L219
- then, the
createBulk()
again calls a map in all the 20k objects: - https://github.com/taskforcesh/bullmq/blob/3b9b79dbdd754ab66c3948e7e16380f2d5513262/src/classes/job.ts#L191-L194
- after the map the
createBulk()
does another loop in all the objects, this time adding the jobs to the redis cliente queue: - https://github.com/taskforcesh/bullmq/blob/3b9b79dbdd754ab66c3948e7e16380f2d5513262/src/classes/job.ts#L198-L205
- the
addJob()
itself is a little expensive if your objects are big since it calls aJSON.stringify()
: - https://github.com/taskforcesh/bullmq/blob/3b9b79dbdd754ab66c3948e7e16380f2d5513262/src/classes/job.ts#L893
- then the
ioredis
just loops again through all data to build theirscripts[]
- https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L321-L345
- and loops again through all data to send each command, one by one to redis
- https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L444-L446
in the end, you have a bunch of copies of the original array; every .map()
allocates a new array, BullMQ maintains the full queue, and ioredis also maintains the full queue with all objects.
that's why this addBulk()
is memory expensive, and it's CPU expensive because it has a lot of loops in all data and every loop blocks the event loop and CPU.
performance issues are tricky; you might think that using time
will get the correct time used by the process or that using the cpuUsage() will get the right answers, but most part of the time is not that simple, you will have other processes in the machine, other tasks running in your node application and so on.
to understand this kind of issue you must dig into the code, read and understand what folks are doing, debugger
and console.log
are your best friends here, external tools will only point the symptom, rarely will show you the issue. sometimes it's better to actually read the code and fully understand what's going on than benchmarking different modules, they might be using the same strategy.
a base ground rule when working with big data in javascript;
- every loop blocks the CPU/Eventloop for
N ms
(depends on the complexity of loop and array size) - every
.map/reduce/split/JSON.stringify/anything that allocates a new object()
in this data will use more memory, sometimes it will double the usage or even go to 3x+, be careful and work reducing the iterations with this data.
✌️