yugabyte-db
yugabyte-db copied to clipboard
[Upgrade] QLRU 12 DBs Upgrade from 2.18.4 to 2024.1.0.0-b123 causes master leader to go unreachable and throughput never comes back
Jira Link: DB-11181
Description
Please find detailed conversation in slack thread. Attaching in JIRA.
I upgraded 2 universes from version 2.18.4 to 2024.1. Both times, the master leader became an unreachable node. Even after bringing it back live by stopping and starting from the AWS console, the dropped connections never returned. Interestingly, this didn’t happen with our 18DBs in our before experiments, whereas here in 12 DBs we hit this, where workload is lighter than previous.
The difference was 18 DBs test was upgraded to 2024.1.0.0-b105 and 12 DBs was to 2024.1.0.0-b123. Earlier, the better defaults from Mark existed, which is now under gflag and by default off, from 2024.1.0.0-b116. https://phorge.dev.yugabyte.com/D34565. Looks like the better defaults was masking this issue.
Regarding the upgrade from 2.18.4 -> 2.20.3 with 12 DBs, I didn’t observe the above issue; none of the nodes became unavailable. Like in our previous experiments, I don’t observe a throughput drop in this experiment too.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- [X] I confirm this issue does not contain any sensitive information.