torrust-tracker icon indicating copy to clipboard operation
torrust-tracker copied to clipboard

Review how we use tracing levels

Open josecelano opened this issue 11 months ago • 0 comments

Relates to: https://github.com/torrust/torrust-tracker/issues/1150

I want to discuss how we use tracking levels, especially the ERROR level. I will explain 2 cases, but this could be applied everywhere in the software.

Case 1

Relates to: https://github.com/torrust/torrust-tracker/issues/1164

When a BitTorrent clients send a wrong connection ID to the UDP tracker, we write an error into the logs:

2024-12-26T16:36:47.110017Z ERROR UDP TRACKER: response error error=cookie value is expired: -0.00000000000000000004989914643359549, expected > 1735230886.1099956 remote_addr=127.0.0.1:34493 local_addr=127.0.0.1:34443 request_id=37145179-19a4-4dee-8de4-38bf78adb947 transaction_id=-1197696568

Case 2

Relates to: https://github.com/torrust/torrust-tracker/issues/1152

The tracker API returns a 500 in some cases, but that does not mean there was a server error. It's the expected API behavior. We want to fix that in the new major API version.

The loged line is like the following:

2024-12-23T16:17:24.085791Z ERROR API: response failed classification=Status code: 500 Internal Server Error latency=0 ms

That line is generated here:

                .on_response(move |response: &Response, latency: Duration, span: &Span| {
                    let latency_ms = latency.as_millis();
                    let status_code = response.status();
                    let request_id = response
                        .headers()
                        .get("x-request-id")
                        .map(|v| v.to_str().unwrap_or_default())
                        .unwrap_or_default();

                    span.record("request_id", request_id);

                    if status_code.is_server_error() {
                        tracing::event!(
                            target: API_LOG_TARGET,
                            tracing::Level::ERROR, %latency_ms, %status_code, %server_socket_addr, %request_id, "response");
                    } else {
                        tracing::event!(
                            target: API_LOG_TARGET,
                            tracing::Level::INFO, %latency_ms, %status_code, %server_socket_addr, %request_id, "response");
                    }
                })
                .on_failure(
                    move |failure_classification: ServerErrorsFailureClass, latency: Duration, _span: &Span| {
                        let latency = Latency::new(LatencyUnit::Millis, latency);

                        tracing::event!(
                            target: API_LOG_TARGET,
                            tracing::Level::ERROR, %failure_classification, %latency, %server_socket_addr, "response failed");
                    },
                ),

I think we should NOT use the ERROR logging level in these two cases because:

  • It's the expected application behaviour.
  • Or the user did not use the application correctly.

I would reserve the ERROR level for things the SysAdmin should investigate. Otherwise, we will add a lot of noise to the logs.

There could be errors if the reason is there was something wrong in the code even when the request was well-formed.

I would use the WARNING level only for the UDP case if we want to detect "bad" application use. However, I would do it via metrics instead of logs. For example, we might detect a non-normal number of "error with connection ID" responses and investigate it, but we first should assume it's just a harmful use of the application.

In conclusion, reserve "ERROR" levels only for unexpected things the SysAdmin should investigate.

What do you thing @da2ce7?

josecelano avatar Dec 26 '24 16:12 josecelano