enso
enso copied to clipboard
Audit Logs should be able to handle pressure
For efficiency, the logs are sent on a background thread.
There can be a problem if the logs are recorded faster than the background thread is able to process them, especially as currently they are sent one-by-one.
On my machine, the round-trip for sending a log message to the cloud usually takes around 100ms but can peak at even 2s per message. This means that in good conditions the maximum throughput can be at around 10 log events per second. This seems to be okay-ish for our current use cases, but it may very easily be overwhelmed if a big operation is performed that runs a lot of queries quickly. If the operation runs for a short time, the pending logs will be queued and should be sent with some delay. The problem starts if the operation keeps running for a longer time at a too high throughput: more and more log messages will be queued and the system may be unable to keep up.
We have two values that we need to weight:
- efficiency
- reliability of log events
The messages are sent in background for efficiency, but if the system is overwhelmed that may lead to problems with reliability.
Solutions to consider:
- [ ] block the main thread once too many log messages are pending
- This would allow the background thread to 'unclog' by processing the pending messages, ensuring that not too many wait in the queue at once.
- IMO this is a reasonable compromise between efficiency and reliability - in usual scenarios the log messages are still sent asynchronously and do not slow down execution at all, but in case that the throughput is exceeded, they start blocking to ensure the background thread has some chance to keep up.
- [ ] batch sent messages
- Currently the messages are sent one-by-one. This is not very efficient as ~100ms time is mostly due to the round-trip latency. Increasing the payload a bit - e.g. by sending ~10-100 messages in one batch could result in similar latency, while achieving a much larger throughput.
- We would still send the messages as they come with no delay. Just if after processing the current message, more than 1 messages are already pending (the messages were enqueued faster than we processed them), we could send multiple messages in a single batch to ensure more efficient processing.
- This would require changes in the Cloud to allow the
/logsendpoint to accept multiple messages in a single request.
Setting as low priority, because while this is a problem, current workloads should not be likely to encounter it too much.
- This would require changes in the Cloud to allow the
/logsendpoint to accept multiple messages in a single request.
As reported by @PabloBuchu the Cloud endpoint for logs can accept a list {logs: [...]} - so batching is now supported.
Radosław Waśko reports a new STANDUP for yesterday (2024-08-27):
Progress: Work on audit log batching - rewriting the background thread to keep a queue. It should be finished by 2024-08-29.
Next Day: Next day I will be working on the same task. Continue
Radosław Waśko reports a new STANDUP for yesterday (2024-08-28):
Progress: Implemented batching. Working on shutting down the thread if not used. It should be finished by 2024-08-29.
Next Day: Next day I will be working on the same task. Ensure liveness of logging. Debug and fix issue with logs on real cloud - getting 400 HTTP errors.
Radosław Waśko reports a new STANDUP for yesterday (2024-08-29):
Progress: CR, some improvements to the audit log batch PR. Start work on Cloud tests on CI. It should be finished by 2024-08-29.
Next Day: Next day I will be working on the #9523 task. Try implementing a prototype
I've implemented the batching part of this task with #10918.
Moving back to New so that we can schedule the second part - handling pressure by blocking further async log requests when the queue exceeds some size. This has lower priority, so we can schedule whenever we have some time OR only if we see practical performance problems. Right now logging is used with low intensity that we should be very unlikely to run into problems with not having this pressure mechanism implemented. But it will be good to have it at some point in the future.