opentelemetry-collector
opentelemetry-collector copied to clipboard
Documentation improvement: tuning the collector for stability & performance
Problem
I've just spent a week tuning our deployed otel collectors to improve stability and performance. I've not seen any discussion around tuning best practices and I think the project could benefit from a document that outlines some recommend parameters.
Solution
Parameters I have in mind discussing include:
- Setting
max_recv_msg_size_mibfor the OTLP receiver to be the value of the batch size most clients are using (should be 512 by default according to the docs) multiplied by the the expected maximum size of the message. For example, given a max expected message size of 100kib, we'd end up setting a recommend value of512 * 100kib = 52mib(rounding up). Without this, it's easy for the collector to drop data when receiving large batches (max message size is only 4mib by default). - For clients with large maximum messages, setting
send_batch_max_sizein the batch processor to be big enough to match what clients are sending otherwise they'll see excessive splitting of batches. This seems to be very memory intensive: I observed a 5x memory utilization drop by changingsend_batch_max_size: 250tosend_batch_max_size: 512: - Enabling advanced compression (e.g. zstd) for collector to collector transmission, as otel collectors all support compression algos beyond gzip.