patroni icon indicating copy to clipboard operation
patroni copied to clipboard

`patronictl list` shows wrong information when etcd cluster is unhealthy

Open daamien opened this issue 4 years ago • 3 comments

Hi !

Thanks for this great software !

Here's a problem I encountered recently with a minimal setup :

  • 3 etcd nodes
  • 2 PostgreSQL nodes

Now imagine a situation where 2 etcd nodes are down. Patroni will demote the primary PostgreSQL node. However the command patronictl list displays the previous state of the PostgreSQL nodes. The old primary is still presented as the "Leader" but patroni has put it in standby mode.

Maybe I missed something in my configuration. Let me know if this is not the expected behaviour of patronictl.

Anyway, this may be confusing for end users. At least when the etcd cluster is unhealthy, patronictl should output a warning saying that "the DCS cluster has a problem" and "the information displayed may be incorrect".

Thanks again for your work

daamien avatar Sep 30 '19 15:09 daamien

That happens because we are not doing quorum reads. IMHO figuring out the state of etcd cluster is not really a task of patronictl. There might be two possibilities to get notion that something is wrong:

  1. switch to quorum reads from patronictl, but in this case it will show nothing
  2. Use the information about leader key ttl. When node is isolated (read-only), ttl becomes negative.

But taking into account that we also have to support consul and zookeeper, looks like that the only choice is the option 1...

CyberDem0n avatar Sep 30 '19 17:09 CyberDem0n

Hi @CyberDem0n

Thanks for the quick reply. I understand that patronictl has to support different distributed stores.

From my perspective, I think it is better to show nothing rather than displaying potentially incorrect information about the PostgreSQL nodes.

When the quorum is lost :

  • If patronictl list returns nothing, the admin will investigate and probably find quickly that the quorum cluster is the source of the problem

  • If patronictl list says that the pirmary node is still the Leader while it is in fact in standby mode, the admin may have some trouble to make diagnosis of the situation.

daamien avatar Sep 30 '19 21:09 daamien

face same issue, my etcd cluster went down and patroni config is in shambles, both slaves are unhealthy now, trying to make them work.

+-------------------+----------+--------------+--------+----------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+----------+--------------+--------+----------+----+-----------+ | patroni_cluster_1 | member_1 | 10.1.100.216 | Leader | running | 5 | 0 | | patroni_cluster_1 | member_2 | 10.1.100.204 | | starting | | unknown | | patroni_cluster_1 | member_3 | 10.1.100.217 | | starting | | unknown | +-------------------+----------+--------------+--------+----------+----+-----------+

ghost avatar Mar 12 '20 14:03 ghost