producer/consumer metrics using redpanda_kafka_request_bytes_total are wrong
This metric incorrectly calculates the usage when a learner event is happening (decom, node add, etc). we should be using these instead for determining on-the-wire traffic for the cluster for produce/consume side throughput.
The metric should update to:
Producer traffic:
sum(rate(redpanda_rpc_received_bytes{redpanda_server="kafka", redpanda_id="$redpanda_id"}[5m])) by (cluster)
Consumer traffic:
sum(rate(redpanda_rpc_sent_bytes{redpanda_server="kafka", redpanda_id="$redpanda_id"}[5m])) by (cluster)
Adjust the labels accordingly to fit the observability repo dashboards.
@hcoyote - does this fix sharechat issue of them seeing replication traffic on the consumer side ?
Neither redpanda_rpc_received_bytes nor redpanda_rpc_sent_bytes includes topic-level detail. In contrast, redpanda_kafka_request_bytes_total does provide topic-level detail (using the redpanda_topic label).
Whether or not that matters is down to the use case. In this example, I wouldn't say we can use one in place of the other.