umbriel icon indicating copy to clipboard operation
umbriel copied to clipboard

Performance issues

Open diego3g opened this issue 3 years ago • 9 comments

I sent an email for 18k contacts and noticed some CPU and memory problems.

Summary

  • The HTTP service got a high CPU usage due to AWS SNS notifications;
  • The queue service got a high CPU usage due to queue processing to send emails;
  • The HTTP service got a high memory usage when adding the batch of jobs to the queue;

Errors

When receiving some AWS events, the HTTP service was logging some errors with this trace:

PrismaClientKnownRequestError2 [PrismaClientKnownRequestError]: Unique constraint failed on the fields: (`message_id`,`contact_id`)
    at cb (/home/node/app/node_modules/@prisma/client/runtime/index.js:34800:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async PrismaRecipientsRepository.saveWithEvents (/home/node/app/dist/modules/broadcasting/repositories/prisma/PrismaRecipientsRepository.js:41:5)
    at async RegisterEvent.execute (/home/node/app/dist/modules/broadcasting/useCases/RegisterEvent/RegisterEvent.js:60:5)
    at async RegisterEventController.handle (/home/node/app/dist/modules/broadcasting/useCases/RegisterEvent/RegisterEventController.js:90:22)
    at async /home/node/app/dist/infra/http/adapters/ExpressRouteAdapter.js:15:26 {
  code: 'P2002',
  clientVersion: '2.28.0',
  meta: { target: [ 'message_id', 'contact_id' ] }
} 

Remediations

Infrastructure

  • [x] Split the HTTP service that receives SNS notifications in a new service with (100mCPU, 144mb RAM and 35% CPU autoscale threshold);
  • [x] Increased the amount of CPU of Queue service to 400m CPU;

Application

  • [ ] Do some benchmarks with Bee Queue module and different concurrency factors;
  • [ ] Add jobs to queue in batches instead of creating a huge jobs array;
  • [x] Try to reduce the amount of queries during notification receiving;

diego3g avatar Aug 13 '21 17:08 diego3g

Did some benchmarking around queue job inserting using BullMQ and got the following results:

  • Inserting 20k jobs in a single batch on the queue leads to 190mb RAM usage around 956ms;
  • Inserting 20k jobs individually on the queue leads to 13mb RAM usage around 430ms (This uses high CPU, but maybe it's better to have higher CPU than memory usage);

Why inserting in batch is more expensive than inserting one by one? I don't know, but it is.

My guesses about the memory usage being around the creation of a big recipients array in the JavaScript side were wrong, the problem was inserting to queue.

diego3g avatar Aug 13 '21 21:08 diego3g

Did some benchmarks with Bee Queue, but it didn't result in a better performance, actually it changed nothing.

diego3g avatar Aug 13 '21 22:08 diego3g

Removed two queries when receiving notifications from Amazon SNS.

Now we don't need to check for recipient existence nor create the events separatelly from upserting the recipient.

diego3g avatar Aug 13 '21 23:08 diego3g

Split the SNS webhook HTTP service on c81033171578d5859383b897fed4b41007bbff76.

diego3g avatar Aug 13 '21 23:08 diego3g

@diego3g

Inserting 20k jobs individually on the queue leads to 13mb RAM usage around 430ms (This uses high CPU, but maybe it's better to have higher CPU than memory usage);

Touchy subject, memory is usually considerably cheaper than vCPU[0]. Some numbers around the CPU usage in both tests would be nice!

It's also easier to manage/scale/allocate memory than CPU. Besides all that, applications in Node.js tend to consume more CPU than memory. Since we [currently] work almost exclusively with this technology, depending on the usage of each resource, it may be worth to sacrifice some RAM. (:

[0]

Item Price
vCPU $0.033174 / vCPU hour
Memory $0.004446 / GB hour

pellizzetti avatar Aug 16 '21 19:08 pellizzetti

Hey @pellizzetti, I tried making some CPU benchmarking but I couldn't manage to get the actual results of CPU consumption. Do you have any tips for me on that?

diego3g avatar Aug 16 '21 19:08 diego3g

@diego3g

If you created new, isolated scripts for your benchmarking, you won't need to handle startup/setup time, so, I guess a simple time would do it, e.g time node <script>. Just a quick reminder, it would be nice to run it some times to get the avg CPU usage of each script.

pellizzetti avatar Aug 16 '21 20:08 pellizzetti

If you want to take a deeper dive, take some time to look over perf/perf_events[0].

[0] https://www.brendangregg.com/perf.html

pellizzetti avatar Aug 16 '21 20:08 pellizzetti

thought this would be a fun quest, decided to jump in and I got your answer :)

Why inserting in batch is more expensive than inserting one by one? I don't know, but it is.

tldr; because extraneous loops allocating different objects and a lot of strings (JSON.stringify()) before sending to redis; when inserting only one you don't allocate all these objects, and you can save some resources with fewer interactions.

here is the full story:

  • calling addBulk() in BullMQ triggers multiple loops in the whole data, initializing a lot of new Objects. testing on my Macbook with simple 20k objects ({ foo: 'bar' }), only this loop takes approximately 150ms~ of pure CPU

  • and it got worse in line #207

const result = (await multi.exec()) as [null | Error, string][];

this basically loops for all the 20k objects that we just added, checking some flags and building a scripts[] array

  • https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L321-L345

then it loops again all the 20k objects and sends one by one to redis

  • https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L444-L446

so, there is a bunch of loops in all data:

  • first the addBulk() itself contains a .map() (which will loop all 20k objects)
  • https://github.com/taskforcesh/bullmq/blob/afdaac6b072c7af5973222cc7fb69f3f138f3b0b/src/classes/queue.ts#L211-L219
  • then, the createBulk() again calls a map in all the 20k objects:
  • https://github.com/taskforcesh/bullmq/blob/3b9b79dbdd754ab66c3948e7e16380f2d5513262/src/classes/job.ts#L191-L194
  • after the map the createBulk() does another loop in all the objects, this time adding the jobs to the redis cliente queue:
  • https://github.com/taskforcesh/bullmq/blob/3b9b79dbdd754ab66c3948e7e16380f2d5513262/src/classes/job.ts#L198-L205
  • the addJob() itself is a little expensive if your objects are big since it calls a JSON.stringify():
  • https://github.com/taskforcesh/bullmq/blob/3b9b79dbdd754ab66c3948e7e16380f2d5513262/src/classes/job.ts#L893
  • then the ioredis just loops again through all data to build their scripts[]
  • https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L321-L345
  • and loops again through all data to send each command, one by one to redis
  • https://github.com/luin/ioredis/blob/e5615da8786956df08a9b33b6cd4dd31e6eaa759/lib/pipeline.ts#L444-L446

in the end, you have a bunch of copies of the original array; every .map() allocates a new array, BullMQ maintains the full queue, and ioredis also maintains the full queue with all objects. that's why this addBulk() is memory expensive, and it's CPU expensive because it has a lot of loops in all data and every loop blocks the event loop and CPU.


performance issues are tricky; you might think that using time will get the correct time used by the process or that using the cpuUsage() will get the right answers, but most part of the time is not that simple, you will have other processes in the machine, other tasks running in your node application and so on.

to understand this kind of issue you must dig into the code, read and understand what folks are doing, debugger and console.log are your best friends here, external tools will only point the symptom, rarely will show you the issue. sometimes it's better to actually read the code and fully understand what's going on than benchmarking different modules, they might be using the same strategy.

a base ground rule when working with big data in javascript;

  • every loop blocks the CPU/Eventloop for N ms (depends on the complexity of loop and array size)
  • every .map/reduce/split/JSON.stringify/anything that allocates a new object() in this data will use more memory, sometimes it will double the usage or even go to 3x+, be careful and work reducing the iterations with this data.

✌️

Deividy avatar Jan 25 '22 09:01 Deividy