torrust-tracker
torrust-tracker copied to clipboard
Review how we use tracing levels
Relates to: https://github.com/torrust/torrust-tracker/issues/1150
I want to discuss how we use tracking levels, especially the ERROR level. I will explain 2 cases, but this could be applied everywhere in the software.
Case 1
Relates to: https://github.com/torrust/torrust-tracker/issues/1164
When a BitTorrent clients send a wrong connection ID to the UDP tracker, we write an error into the logs:
2024-12-26T16:36:47.110017Z ERROR UDP TRACKER: response error error=cookie value is expired: -0.00000000000000000004989914643359549, expected > 1735230886.1099956 remote_addr=127.0.0.1:34493 local_addr=127.0.0.1:34443 request_id=37145179-19a4-4dee-8de4-38bf78adb947 transaction_id=-1197696568
Case 2
Relates to: https://github.com/torrust/torrust-tracker/issues/1152
The tracker API returns a 500 in some cases, but that does not mean there was a server error. It's the expected API behavior. We want to fix that in the new major API version.
The loged line is like the following:
2024-12-23T16:17:24.085791Z ERROR API: response failed classification=Status code: 500 Internal Server Error latency=0 ms
That line is generated here:
.on_response(move |response: &Response, latency: Duration, span: &Span| {
let latency_ms = latency.as_millis();
let status_code = response.status();
let request_id = response
.headers()
.get("x-request-id")
.map(|v| v.to_str().unwrap_or_default())
.unwrap_or_default();
span.record("request_id", request_id);
if status_code.is_server_error() {
tracing::event!(
target: API_LOG_TARGET,
tracing::Level::ERROR, %latency_ms, %status_code, %server_socket_addr, %request_id, "response");
} else {
tracing::event!(
target: API_LOG_TARGET,
tracing::Level::INFO, %latency_ms, %status_code, %server_socket_addr, %request_id, "response");
}
})
.on_failure(
move |failure_classification: ServerErrorsFailureClass, latency: Duration, _span: &Span| {
let latency = Latency::new(LatencyUnit::Millis, latency);
tracing::event!(
target: API_LOG_TARGET,
tracing::Level::ERROR, %failure_classification, %latency, %server_socket_addr, "response failed");
},
),
I think we should NOT use the ERROR logging level in these two cases because:
- It's the expected application behaviour.
- Or the user did not use the application correctly.
I would reserve the ERROR level for things the SysAdmin should investigate. Otherwise, we will add a lot of noise to the logs.
There could be errors if the reason is there was something wrong in the code even when the request was well-formed.
I would use the WARNING level only for the UDP case if we want to detect "bad" application use. However, I would do it via metrics instead of logs. For example, we might detect a non-normal number of "error with connection ID" responses and investigate it, but we first should assume it's just a harmful use of the application.
In conclusion, reserve "ERROR" levels only for unexpected things the SysAdmin should investigate.
What do you thing @da2ce7?