PushProx icon indicating copy to clipboard operation
PushProx copied to clipboard

Client disappearing

Open Rudd-O opened this issue 6 years ago • 13 comments

GC of client that was still running took place out of nowhere:

level=error ts=2017-10-03T09:21:09.446332474Z caller=proxy.go:97 msg="Responded to /clients" client_count=1
level=info ts=2017-10-03T09:21:28.653546534Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:22:28.653595779Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:23:28.653617421Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:24:28.653431992Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:25:28.653501368Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:26:28.653597843Z caller=coordinator.go:179 msg="GC of clients completed" deleted=1 remaining=0
level=info ts=2017-10-03T09:27:28.653382959Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=0

Client is still running.

Restarting proxy re-registers the client as the client retries.

Rudd-O avatar Oct 03 '17 11:10 Rudd-O

Disappears after 5 minutes:

level=info ts=2017-10-03T11:00:17.657654193Z caller=proxy.go:104 msg=Listening address=:8080
level=info ts=2017-10-03T11:00:18.079916119Z caller=coordinator.go:110 msg=WaitForScrapeInstruction fqdn=
level=error ts=2017-10-03T11:00:22.815177936Z caller=proxy.go:97 msg="Responded to /clients" client_count=1
level=info ts=2017-10-03T11:01:17.657898105Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:02:17.657956693Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:03:17.657939787Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:04:17.657922462Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:05:17.657942618Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:06:17.657953134Z caller=coordinator.go:179 msg="GC of clients completed" deleted=1 remaining=0

Rudd-O avatar Oct 03 '17 12:10 Rudd-O

Whatever endpoint Prometheus did not scrape gets garbage-collected after five minutes. This means that a Prometheus outage of more than five minutes makes the proxy think the app has disappeared altogether.

Rudd-O avatar Oct 03 '17 12:10 Rudd-O

Can the GC thread clean up and close the connection so the client can reconnect if it's still alive? The client isn't getting any signal that the connection has been closed, and thus never attempts to reconnect.

Rudd-O avatar Oct 03 '17 12:10 Rudd-O

Thanks for highlighting and debugging this issue!

I'm working on a fix for this now to join the paths correctly and avoid this bug.

conr avatar Oct 03 '17 13:10 conr

Awesome, but this is the bug about the disappearing client. The improperly joined URL is #9.

Rudd-O avatar Oct 03 '17 13:10 Rudd-O

The stderr log shows the client disappearing, deleted=1 and remaining=0, but somehow when I try to scrape again, bam, the scrape worked. I am closing this but note that the message is misleading.

Rudd-O avatar Oct 03 '17 13:10 Rudd-O

Sorry about that. I meant to comment on issue #9 !

I'll take a look into this as well. Thanks for reporting!

conr avatar Oct 03 '17 13:10 conr

I think this issue should be reopened because the current behaviour is not consistent. When prometheus is not scraping a client for a few minutes, the client will disappear from the /clients list until the client restarts or prometheus scrapes it again. Since /clients is often used to generate the scraping configuration for prometheus the disappeared clients will also be dropped from the configuration. So the only way to restore the system to a functional state is by restarting of the client.

In my understanding a client should only be dropped from the /clients list, when the client is unreachable or not running anymore.

toerb avatar Apr 03 '18 06:04 toerb

In my understanding a client should only be dropped from the /clients list, when the client is unreachable or not running anymore.

Exactly, now it's kinda useless compared to what Brian wrote about getting it off wget via cron. I have some blackbox exporters and they dissapear all the time

fajfer avatar Sep 13 '18 12:09 fajfer

I'm seeing the same issue. It seems like the GC process should close the client connection if it removes it from the config.

claytono avatar Jan 13 '20 19:01 claytono

Do you want to send a PR?

brian-brazil avatar Jan 13 '20 19:01 brian-brazil

Sure, I can give it a shot. I was just getting up to speed on the code. If you've got any suggestions what might be a good way to fit this in, I'd be happy to hear it. My thoughts so far were to try to cancel the context when removing a known client, or to try to renew the timestamp when keepalives are seen. I'm not yet sure how practical either approach is at this point.

claytono avatar Jan 13 '20 21:01 claytono

Probably better to keep it around if it's still working.

brian-brazil avatar Jan 13 '20 22:01 brian-brazil