patroni
patroni copied to clipboard
`patronictl list` shows wrong information when etcd cluster is unhealthy
Hi !
Thanks for this great software !
Here's a problem I encountered recently with a minimal setup :
- 3 etcd nodes
- 2 PostgreSQL nodes
Now imagine a situation where 2 etcd nodes are down. Patroni will demote the primary PostgreSQL node. However the command patronictl list
displays the previous state of the PostgreSQL nodes. The old primary is still presented as the "Leader" but patroni has put it in standby mode.
Maybe I missed something in my configuration. Let me know if this is not the expected behaviour of patronictl
.
Anyway, this may be confusing for end users. At least when the etcd cluster is unhealthy, patronictl
should output a warning saying that "the DCS cluster has a problem" and "the information displayed may be incorrect".
Thanks again for your work
That happens because we are not doing quorum reads. IMHO figuring out the state of etcd cluster is not really a task of patronictl. There might be two possibilities to get notion that something is wrong:
- switch to quorum reads from patronictl, but in this case it will show nothing
- Use the information about leader key ttl. When node is isolated (read-only), ttl becomes negative.
But taking into account that we also have to support consul and zookeeper, looks like that the only choice is the option 1...
Hi @CyberDem0n
Thanks for the quick reply. I understand that patronictl has to support different distributed stores.
From my perspective, I think it is better to show nothing rather than displaying potentially incorrect information about the PostgreSQL nodes.
When the quorum is lost :
-
If
patronictl list
returns nothing, the admin will investigate and probably find quickly that the quorum cluster is the source of the problem -
If
patronictl list
says that the pirmary node is still the Leader while it is in fact in standby mode, the admin may have some trouble to make diagnosis of the situation.
face same issue, my etcd cluster went down and patroni config is in shambles, both slaves are unhealthy now, trying to make them work.
+-------------------+----------+--------------+--------+----------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+----------+--------------+--------+----------+----+-----------+ | patroni_cluster_1 | member_1 | 10.1.100.216 | Leader | running | 5 | 0 | | patroni_cluster_1 | member_2 | 10.1.100.204 | | starting | | unknown | | patroni_cluster_1 | member_3 | 10.1.100.217 | | starting | | unknown | +-------------------+----------+--------------+--------+----------+----+-----------+