Loki logs are inconsistently returned when the time query is changed slightly
Yesterday, I upgraded my Loki installation to 3.0.0 via the helm chart. Since then, I have been having issues with some of my dashboards missing data. During my investigations of those issues, I found out that on some queries Loki does not return any data where it clearly has data to return.
This logcli query runs from 0:00 to 2:00 local time over the ingress-nginx logs of last night, some sensitive details snipped:
$ logcli query '{namespace="ingress-nginx"}' --from '2024-07-04T00:00:00+02:00' --to '2024-07-04T02:00:00+02:00' | grep robots
2024/07/04 11:05:55 http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1720051200000000000&limit=30&query=%7Bnamespace%3D%22ingress-nginx%22%7D&start=1720044000000000000
2024/07/04 11:05:55 Common labels: {app="ingress-nginx", component="controller", container="controller", instance="ingress-nginx", job="ingress-nginx/ingress-nginx", level="info", namespace="ingress-nginx", service_name="ingress-nginx", stream="stdout"}
2024-07-04T01:59:03+02:00 {filename="/var/log/pods/ingress-nginx_ingress-nginx-controller-8447845fb9-gp9rg_60bf31fc-d1d2-467d-9e49-5203ce38108d/controller/0.log", node_name="[snip]", pod="ingress-nginx-controller-8447845fb9-gp9rg"} [snip] - - [03/Jul/2024:23:59:03 +0000] "GET /robots.txt HTTP/1.1" 200 248 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; [email protected])" 224 0.118 [snip] [] 100.64.1.181:80 248 0.119 200 0ea9a5c8c5c851e846ba7480ca774ebd
2024-07-04T01:59:02+02:00 {filename="/var/log/pods/ingress-nginx_ingress-nginx-controller-8447845fb9-gp9rg_60bf31fc-d1d2-467d-9e49-5203ce38108d/controller/0.log", node_name="[snip]", pod="ingress-nginx-controller-8447845fb9-gp9rg"} [snip] - - [03/Jul/2024:23:59:02 +0000] "GET /robots.txt HTTP/1.1" 308 164 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; [email protected])" 224 0.000 [snip] [] - - - - a4b11244e9598bd78c082f5b5b3b7480
$
Clearly, there are log lines at 01:59:02 and 01:59:03 local time. When I change my request to not end at 02:00:00 local time, but 01:59:59 local time, I would expect the same output, since the time range clearly still includes those two log lines.
$ logcli query '{namespace="ingress-nginx"}' --from '2024-07-04T00:00:00+02:00' --to '2024-07-04T01:59:59+02:00' | grep robots
2024/07/04 11:09:25 http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1720051199000000000&limit=30&query=%7Bnamespace%3D%22ingress-nginx%22%7D&start=1720044000000000000
$
Instead, no results are generated. I see this on other namespaces and other types of queries as well - as far as I can tell there's nothing particular to this request. When I use 02:00:01 as an end date instead, results are returned normally. Changing the start date to various values does not seem to have any unexpected effects. The results are fully reproducible.
What could cause this?
This is logcli 3.0.0; localhost:3100 is a kubectl port-forward to the prometheus-loki-gateway service (port 80 on the service) and Loki is also version 3.0.0. I'm running with three backends, three read replicas, three write replicas and in-cluster minio. My config.yaml is config.yaml. Please let me know if there's anything else I could supply to help investigate this.
Anyone had a look to this ? We are having similar issues while trying to load +/- 266k lines of logs on a 1d duration with logcli, only a subset is returned (127296 lines)
Hi team, We did some additional testing on our side and discovered that the result cache was causing missing log records. If we disable the result cache, we have have the expected number of logs.
Note: for a query where we want to download all logs for a given log stream, we get consistent results, regardless of whether the result cache is enabled or not.
Note 2: For a given query, we use the metric query sum(count_over_time(<query>[1d])) to count the expected number of logs we would download.
HTH