agent icon indicating copy to clipboard operation
agent copied to clipboard

Panic when reloading with changed remote_write config

Open tpaschalis opened this issue 3 years ago • 2 comments

I was testing #1478 with a dummy remote_write endpoint, when the Agent suddenly panicked and shut down.

Incidentally, it was around the same time I was trying to reload a config file when curl localhost:12345/-/reload hanged; after a while the command returned and the Agent was down. Before and after that, all calls to /-/reload worked fine, so I'm not sure whether it was related or an once-off occurence.

I'll investigate this further as I wasn't able to reproduce this under the same conditions, but I'm leaving the stack trace here for future reference and for others.

ts=2022-04-19T09:42:47.020614397Z caller=gokit.go:65 level=debug msg="GET /agent/api/v1/metrics/integrations/sd?instance=ebpf&integrations=ebpf (200) 181.043µs"
ts=2022-04-19T09:42:50.709134329Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=error remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Failed to flush metadata"
ts=2022-04-19T09:42:50.70948193Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=error remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="non-recoverable error while sending metadata" count=178 err="context canceled"
ts=2022-04-19T09:42:50.709647309Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=info remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Scraped metadata watcher stopped"
ts=2022-04-19T09:42:50.710576469Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=info remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Remote storage stopped."
ts=2022-04-19T09:42:50.711054063Z caller=manager.go:271 level=info agent=prometheus msg="stopped instance" instance=1441107aae5b2bac39bf41c80aa47c88
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x26d30ec]

goroutine 276 [running]:
github.com/grafana/agent/pkg/metrics/instance.(*Instance).Appender(0xc00024a100?, {0x52744b0?, 0xc000696800?})
	/root/agent/pkg/metrics/instance/instance.go:568 +0x2c
github.com/grafana/agent/pkg/integrations/v2/autoscrape.(*agentAppender).Appender(0xc00014bce0, {0x52744b0, 0xc000696800})
	/root/agent/pkg/integrations/v2/autoscrape/autoscrape.go:248 +0xa3
github.com/prometheus/prometheus/scrape.newScrapePool.func2.3({0x52744b0?, 0xc000696800?})
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:311 +0x35
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0xc0007bf380, {0xed9f07abf?, 0x7746400?, 0x7746400?}, {0xed9f07abf?, 0x7746400?, 0x7746400?}, 0x0)
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:1257 +0x3bc
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc0007bf380, 0xc000bda480?)
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:1216 +0x345
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
	/root/go/pkg/mod/github.com/grafana/[email protected]/scrape/scrape.go:587 +0xa1e

Edit: as far as I could tell, this happened with the default agent-local-config.yaml file, as I was changing the dummy remote_write with my Grafana cloud stack's details.

tpaschalis avatar Apr 19 '22 10:04 tpaschalis

Here's a short reproduction on main (currently on https://github.com/grafana/agent/commit/2f4b5b37ac470fa5b0ebecb039495d8d397b711e). I'm running go1.18 darwin/arm64

  1. Checkout and build the Agent. Run it with integrations-next enabled and the following config (basically agent-local-config.yaml but with the config name changed to 'default', otherwise the Agent integration complains).
server:
  log_level: info

metrics:
  global:
    scrape_interval: 1m
  configs:
    - name: default
      host_filter: false
      scrape_configs:
        - job_name: local_scrape
          static_configs:
            - targets: ['127.0.0.1:12345']
              labels:
                cluster: 'localhost'
      remote_write:
        - url: http://localhost:9009/api/prom/push

integrations:
  agent:
    instance: "foo"

Wait for at least one unsuccessful push to happen. It will emit logs like

ts=2022-04-20T11:47:25.870629Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=warn remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Failed to send batch, retrying" err="Post \"http://localhost:9009/api/prom/push\": dial tcp [::1]:9009: connect: connection refused"
  1. Change the config to include a 'real' remote_write endpoint, like
server:
  log_level: info

metrics:
  global:
    scrape_interval: 1m
  configs:
    - name: default
      host_filter: false
      scrape_configs:
        - job_name: local_scrape
          static_configs:
            - targets: ['127.0.0.1:12345']
              labels:
                cluster: 'localhost'
      remote_write:
      - url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
        basic_auth:
          username: 297344
          password: eyJ<redacted>


integrations:
  agent:
    instance: "foo"

Save and reload the config by calling the /-/reload endpoint. The command hangs, and the Agent panics after 60 seconds minutes with the stack trace presented above.

I tried with node_exporter, but couldn't get the same issue, so it may have to do with the way we set up the v2 Agent integration.

Edit: The nil pointer is on the wal field of the metrics instance. It might that there's not enough time to re-initialize the WAL after the reload, before an autoscrape kicks in?

Edit 2: Disabling autoscrape and thus giving it a moment to replay the WAL makes it work.

tpaschalis avatar Apr 20 '22 11:04 tpaschalis

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

github-actions[bot] avatar May 28 '22 00:05 github-actions[bot]