promscale icon indicating copy to clipboard operation
promscale copied to clipboard

recovery from connection closed

Open phebous opened this issue 3 years ago • 4 comments

Hello,

I had to increase disk for the timescale database. Because I am using bosh to manage the vm for the timescale, this is completely automated. I just update the manifest with new size and it goes through the process shutting down the database, creating new disk, copying the data to the new disk and then restarts the server. The bug I found with the promscale was when the database is shutdown, this results with the active connection to the database to receive an error "conn closed". The problem with this is it stays in this state even after the database comes back up. On the positive, the promscale instance which is not the leader is there to pick up the prometheus traffic and become the leader, so the only downtime is associated with database down.

So my ask is to have promscale deal better with connection closed. Currently, promscale is stuck in connection closed. The only way I know to address this is to bounce the "conn closed" promscale to get it out of of this state. Can we make promscale more robust to allow for a graceful recovery from "conn closed"?

phebous avatar Dec 15 '20 20:12 phebous

Thanks for this report. Do you think it's better to exit the process or recover internally?

cevian avatar Dec 18 '20 21:12 cevian

Since this is a service and meant to be ran unattended, I would think that it should attempt to recover.

phebous avatar Dec 18 '20 21:12 phebous

Cevian,

I like the latest version. It now recovers database closures on the standby instance. However, the leader still has a problem and would require a bounce before it will take over a role.

phebous avatar Feb 03 '21 17:02 phebous

@phebous when you refer to leader Promscale, are you using advisory locks based Promscale HA offered to manage Prometheus HA data?

VineethReddy02 avatar May 30 '22 07:05 VineethReddy02

@phebous I'm closing this issue as the Promscale leader-based HA is deprecated. Today all Promscale COnnector instances are connected to DB as standby instances. This is created in 2020, and since then the Promscale has been improved in terms of reliability and performance, so this hasn't been noticed in our internal tests.

Feel free to re-open the issue, if you still see the issue. :)

VineethReddy02 avatar Aug 17 '22 03:08 VineethReddy02