loki icon indicating copy to clipboard operation
loki copied to clipboard

failed mapping AST context canceled

Open RobbanHoglund opened this issue 1 year ago • 42 comments

Describe the bug After upgrade to Loki 3.2.0 it is repeatedly logging this for some queries: ts=2024-10-04T01:09:28.649297614Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake traceID=35d59f6204919421 user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{application="myapp",level=~"INFO|WARN|ERROR"} "

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 3.2.0
  2. From Grafana explore do {application="myapp",level=~"INFO|WARN|ERROR|TRACE"}

Expected behavior If this is an actual problem in Loki 3.2.0 it would be good to get more information about what is the root cause.

Environment:

  • Infrastructure: bare-metal
  • Deployment tool: bash scripts

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem. Image Image

RobbanHoglund avatar Oct 04 '24 06:10 RobbanHoglund

Also seeing this in a cluster running 3.2.0, however its only happening on one of our Read pods (Running SSD via Helm), the same pod is reporting High query latency (upto 50s), and reporting a fair amount of context cancelled in the logs, All the other read pods are behaving just fine and returning query in sub-seconds

jammiemil avatar Oct 05 '24 21:10 jammiemil

I ran into something similar with 3.1.1 . Restarting the broken pod seem to have fixed that.

eplightning avatar Oct 21 '24 09:10 eplightning

Indeed, I ran into this issue is 3.2.0 and killing the pod and spin up a new one fixed it

yalattas avatar Oct 21 '24 14:10 yalattas

I'm having the same issue with Loki 3.2.0 deployed using Docker and SSD. Once in a while Read container return msg="failed mapping AST" err="context canceled" errors, only way to resolve is to restart the read container. I don't spot any issues in metrics...

vincentnonim avatar Oct 22 '24 07:10 vincentnonim

+1 here. When it happens, seems more severe on one specifically more then the others. Running loki 3.2.1

Restarting the pod works for me as well.

JeffreyVdb avatar Oct 30 '24 06:10 JeffreyVdb

We are running monolithic mode and we have restarted the Loki processes several times but nothing seems to help.

RobbanHoglund avatar Oct 30 '24 09:10 RobbanHoglund

Same issue in my case, deployed loki in simple scalable mode. When I run the query, I keep seeing this error. Restarting or redeployment does not help.

Khagesh16 avatar Oct 31 '24 14:10 Khagesh16

Ok so i restarted the read pods, still get the error, but no more EOF errors/time outs from queries. It seems a good way to detect if the impact is noticable on client's perspective is to check if there is any 499 HTTP code error returned from the nginx. There is probably some loki metrics somewhere that indicate it, but i could not find it. Also, no idea yet how to go around the problem. Will try to look at the code if i can see where this error comes from and what we can do to go around it.

I did some digging, and my initial problems of EOF/timeout is not linked to that warning. I tried to removed a bunch of thing from my config hoping to remove this warning, could not find any way to remove it. Remove my old store config, did not make it dissapear. Looking at the code, it seems to happen when parsing the query and sharding the query.

Will sadly stop the investigation here since my query problems does not seem to be linked to this warning. Still, would be nice to have this warning dissapear next release.

+1 we also see this issue occasionally and deleting the affected pod fixes the issue.

emadolsky avatar Nov 07 '24 11:11 emadolsky

We run loki 3.1.1 with deploymentMode: SimpleScalable and also saw the same error. Restarting the loki-read deployment fixed the issue in seconds.

BlexToGo avatar Nov 07 '24 14:11 BlexToGo

I'm experiencing the same issue. I have a few clusters running Loki 3.2.1 in Simple Scalable Mode (SSD). All clusters display the warning "failed mapping AST," though it doesn’t seem to impact Loki’s functionality, except in two specific clusters. In these cases, similar to previous reports, one of the Pods stops processing queries and only resumes normal operation after a restart.

This issue began occurring only after upgrading from version 2.9.X to 3.2.X, following the guide for upgrading to version 3.X.

As @someStrangerFromTheAbyss mentioned, there is a noticeable increase in 499 HTTP status codes, which can be seen on the "Loki Operational" Grafana Dashboard.

In my observations, the affected Pod still receives requests from Nginx (loki-gateway) but doesn’t return a response, appearing to get stuck while processing. My deployment relies on AWS S3.

marcotuna avatar Nov 09 '24 17:11 marcotuna

For the http code 499, i can confirm these are not linked. I was able to resolve the 499 issue seperately from that error message. I made a really big answer on the 499 issue here: https://github.com/grafana/loki/issues/7084

TLDR: The 499 issue is linked to the boltdb-shipper indexer and connection to the backend when using boltdb-shipper. If you can remove these, the 499 should dissapear.

EDIT: After some additionnal digging, im no longer convinced of this. The issue has reapear on my side. Happens less often, but still does.

Happened to use as well. Any idea how to catch this with a livenessProbe or something similar? It fails quite silently and you will only notice once you use it (schrödingers loki).

jan-kantert avatar Nov 22 '24 12:11 jan-kantert

any update of this issue ?

EZ4BRUCE avatar Nov 27 '24 03:11 EZ4BRUCE

We are observing this issue on loki 3.3.0 deployed via helm chart version: 6.22.0 as well. We are running loki in SimpleScalable deployment. We see a lot of those logs when using Explore logs feature on grafana v11.3.1 (64b556c137): (we have all of the prerequisites on loki & grafana configured).

LeszekBlazewski avatar Nov 29 '24 12:11 LeszekBlazewski

After some more digging, its seems this error is linked to the 499 error indeed,but not all the time. Although, when a query does get a 504 on the grafana side, its most likely that error log appears on the bad pod. The query does reach the pod and i can see the read pod processing the query, but the last step in its querying step is the failed mapping AST error log.

Seems to only happen to one of the read pods after a deployment and happens most of the time with the first read pods that connect to the backend pod. Im back at investiguating the issue since this affects client queries.

I'm experiencing the same problem. I'm running Loki version 3.2.0, and at some moments, I cannot load Label filters in Grafana -> Explore, getting an infinite spinner in the browser's Label filter field and "failed mapping AST" in loki-read pods. Restart helps for a while, but the same situation reoccurs in 12 hours.

disparo13 avatar Nov 29 '24 17:11 disparo13

@someStrangerFromTheAbyss I can verify that it's a connectivity issue between reads and backends, and it happens as well when you just rollout the backend pods.

emadolsky avatar Dec 05 '24 11:12 emadolsky

We have this issues from time to time as well. No noticable change upgrading to Loki v3.3.0. Restart seams to help for a while but then it reapers.

Starefossen avatar Dec 06 '24 08:12 Starefossen

Bump, stuck on this issue since months. Even restart does not seem to be helping.

Khagesh16 avatar Dec 16 '24 14:12 Khagesh16

I am currently using the following components. When I use the Grafana Explore log feature and refresh, this error keeps occurring. Has this issue been resolved?

promtail: 3.0.0 loki: 3.3.2 (single binary) grafana: 11.3"

garam42 avatar Jan 07 '25 00:01 garam42

We also see the failed mapping AST messages increase along with HTTP 499 in loki-read Pods after setting bloom_gateway_enable_filtering: true. Loki queries in Grafana intermittently fail with error messages.

Note; we've had bloom_creation_enabled: true for a few days without seeing these issues.

Could the issue be related to bloom filters? Anyone else in this thread have this enabled?

Restarting the loki-read Deployment seems to have temporarily fixed it. Let's see if it comes back.

lindeskar avatar Jan 24 '25 14:01 lindeskar

One more with loki 3.1.1 on "read" pods, lots of:

ts=... caller=spanlogger.go:109 middleware=QueryShard.astMapperware org_id=... traceID=622160cf96ed4ead user=... level=warn msg="failed mapping AST" err="context canceled" query="..."

This happened after taking the one "backend" pod offline for maintenance for a few minutes. Restarting the "read" deployment worked; no bloom feature enabled here.

r-asiebert avatar Jan 28 '25 19:01 r-asiebert

we're facing the same issue. any pointers here? restarting read deployment fixes it.

tuananh avatar Mar 05 '25 03:03 tuananh

same issue with helm chart loki version 6.27.0 (Application version 3.4.2)

patsevanton avatar Mar 05 '25 11:03 patsevanton

We're also experiencing this issue with Loki 3.4.2, using SSD deployment in k8s via Helm chart 6.28.0. bloom filtering is not enabled, at least it does not seem to be enabled according to runtime config.

Restarting seems to help

alexgenon avatar Mar 27 '25 14:03 alexgenon

We are also experiencing this issue with via Helm chart 6.28.0 , using SSD. Restarting the loki-read pods resolves the issue. Problem occurs at least once a month.

scottsananderson avatar Mar 31 '25 14:03 scottsananderson

Is a livenessProbe able to solve this problem too?

towolf avatar Mar 31 '25 15:03 towolf