vitess icon indicating copy to clipboard operation
vitess copied to clipboard

Bug Report: 'show vitess_replication_status' always says ReplicationLag is 0

Open wiebeytec opened this issue 1 year ago • 2 comments

Overview of the Issue

I had restarted my replicas, and because the (deprecated) option skip_slave_start is set in /vt/vt_0000000300/my.cnf, it didn't start. I then started it with start replica on the mysql prompt. Then, with show replica status, I see various things, but most importantly these keep changing:

Relay_Log_Pos: 149538815
Exec_Source_Log_Pos: 170394880
Seconds_Behind_Source: 66829

The Seconds_Behind_Source is important.

But, in show vitess_replication_status it's always 0:

mysql> show vitess_replication_status;
+-----------+-------+------------+--------------------------------------+--------------+---------------------+-------------------------------------------------------------------------+----------------+-----------------+
| Keyspace  | Shard | TabletType | Alias                                | Hostname     | ReplicationSource   | ReplicationHealth                                                       | ReplicationLag | ThrottlerStatus |
+-----------+-------+------------+--------------------------------------+--------------+---------------------+-------------------------------------------------------------------------+----------------+-----------------+
| sites2023 | -80   | REPLICA    | vic-eu-central-1-fakelive-0000000400 | 172.31.4.238 | 172.31.27.190:17300 | {"EventStreamRunning":"Yes","EventApplierRunning":"Yes","LastError":""} | 0              |                 |
| sites2023 | 80-   | REPLICA    | vic-eu-central-1-fakelive-0000000401 | 172.31.3.243 | 172.31.17.118:17301 | {"EventStreamRunning":"No","EventApplierRunning":"No","LastError":""}   | 0              |                 |
+-----------+-------+------------+--------------------------------------+--------------+---------------------+-------------------------------------------------------------------------+----------------+-----------------+

Reproduction Steps

  • Have a replica in a keyspace.
  • Make sure your primary has a lot of write traffic
  • shut down the replica
  • wait
  • start the replica again (START REPLICA or a Vitess command?)
  • Observe the ReplicationLag in show vitess_replication_status

Binary Version

vttablet version Version: 18.0.2 (Git revision d3012c188ea0cfc6837917fc6642ea23be9bb1ff branch 'HEAD') built on Wed Dec 20 14:27:31 UTC 2023 by runner@fv-az975-901 using go1.21.5 linux/amd64vttablet version Version: 18.0.2 (Git revision d3012c188ea0cfc6837917fc6642ea23be9bb1ff branch 'HEAD') built on Wed Dec 20 14:27:31 UTC 2023 by runner@fv-az975-901 using go1.21.5 linux/amd64

Operating System and Environment details

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"
aptitude show mysql-server-core-8.0
Package: mysql-server-core-8.0    
Version: 8.0.36-0ubuntu0.22.04.1


### Log Fragments

```sh
mysql> show vitess_tablets;
+---------------------------+-----------+-------+------------+---------+--------------------------------------+---------------+----------------------+
| Cell                      | Keyspace  | Shard | TabletType | State   | Alias                                | Hostname      | PrimaryTermStartTime |
+---------------------------+-----------+-------+------------+---------+--------------------------------------+---------------+----------------------+
| vic-eu-central-1-fakelive | meta2023  | 0     | PRIMARY    | SERVING | vic-eu-central-1-fakelive-0000000700 | 172.31.25.159 | 2024-01-02T15:57:00Z |
| vic-eu-central-1-fakelive | sites2023 | -80   | PRIMARY    | SERVING | vic-eu-central-1-fakelive-0000000300 | 172.31.27.190 | 2024-02-16T09:31:01Z |
| vic-eu-central-1-fakelive | sites2023 | -80   | REPLICA    | SERVING | vic-eu-central-1-fakelive-0000000400 | 172.31.4.238  |                      |
| vic-eu-central-1-fakelive | sites2023 | 80-   | PRIMARY    | SERVING | vic-eu-central-1-fakelive-0000000301 | 172.31.17.118 | 2024-02-16T09:31:10Z |
| vic-eu-central-1-fakelive | sites2023 | 80-   | REPLICA    | SERVING | vic-eu-central-1-fakelive-0000000401 | 172.31.3.243  |                      |
| vic-eu-central-1-fakelive | legacy    | 0     | PRIMARY    | SERVING | vic-eu-central-1-fakelive-0000000600 | 172.31.25.59  | 2024-01-05T10:26:52Z |
+---------------------------+-----------+-------+------------+---------+--------------------------------------+---------------+----------------------+
6 rows in set (0,00 sec)

wiebeytec avatar Feb 23 '24 11:02 wiebeytec

Hi @wiebeytec,

Can you please provide a test case demonstrating this with the local examples? It should be relatively easy to do given the same basic process you showed here. It's hard to follow what you're seeing as you are mixing various things together here as your first output is for vt_0000000300 and the next for vic-eu-central-1-fakelive-0000000401.

I have been able to repeat one thing that you showed:

{"EventStreamRunning":"No","EventApplierRunning":"No","LastError":""}   | 0

That's because when replication is not running the lag is unknown. MySQL represents that with NULL for Seconds_Behind_Source. As an integer, that becomes the zeroval. I will change that handling to use a string instead so that we also display NULL in this case. The test case for that is:

git checkout main && make build

pushd examples/local 

./101_initial_cluster.sh

../common/scripts/vtorc-down.sh

command mysql -u root --socket=${VTDATAROOT}/vt_0000000102/mysql.sock --binary-as-hex=false -e "stop slave"

mysql < ../common/insert_commerce_data.sql

sleep 5

mysql -e "show vitess_replication_status"

The results::

+----------+-------+------------+------------------+-----------+-------------------+-------------------------------------------------------------------------+----------------+-----------------------------------------+
| Keyspace | Shard | TabletType | Alias            | Hostname  | ReplicationSource | ReplicationHealth                                                       | ReplicationLag | ThrottlerStatus                         |
+----------+-------+------------+------------------+-----------+-------------------+-------------------------------------------------------------------------+----------------+-----------------------------------------+
| commerce | 0     | REPLICA    | zone1-0000000100 | localhost | localhost:17101   | {"EventStreamRunning":"Yes","EventApplierRunning":"Yes","LastError":""} | 0              | {"state":"OK","load":0.00,"message":""} |
| commerce | 0     | RDONLY     | zone1-0000000102 | localhost | localhost:17101   | {"EventStreamRunning":"No","EventApplierRunning":"No","LastError":""}   | 0              | {"state":"OK","load":0.00,"message":""} |
+----------+-------+------------+------------------+-----------+-------------------+-------------------------------------------------------------------------+----------------+-----------------------------------------+

I'll try some more to repeat the behavior you're seeing but a test case would be very helpful. If I cannot repeat it then I will only fix what I was able to repeat.

Thank you!

mattlord avatar Feb 24 '24 17:02 mattlord

Hi @wiebeytec ,

When implementing this show vitess_replication_status command I used the tablet's replication lag stats variable (managed within the discover package's tablet healthcheck) instead of the seconds_behind_[master|source] value from mysqld precisely because the stats variable will estimate the lag when replication is not running (this value is used to determine the lag and whether or not the tablet should be serving or not). I've confirmed that this stats variable is no longer working as expected, which we can also see in the tablet's HTTP status page output where it's also used:

❯ curl -s http://localhost:15102/debug/status | grep -A1 "Replication Lag"
    <td>Replication Lag</td>
    <td>0s</td>

I will investigate that underlying issue further and open a PR to fix it.

Thanks again!

mattlord avatar Feb 24 '24 17:02 mattlord

My reference to vt_0000000300 was merely to explicitly use a config file name where the skip_slave_start was located , but not specific to one managed mysql. It was to prevent saying something vague like 'the mysql config'. That could have been more clear; true.

Anyway, going by your last comment, all confusion seems cleared up :+1:

wiebeytec avatar Feb 26 '24 08:02 wiebeytec