orchestrator QA: Can we use Seconds_Behind_Master(MariaDB 10.5.x) for ReplicationLagQuery?

Hi, @shlomi-noach.

As described in https://github.com/openark/orchestrator/blob/master/docs/configuration-recovery.md#promotion-actions

FailMasterPromotionOnLagMinutes: defaults 0 (not failing promotion). Can be used to fail a promotion if the candidate replica is too far behind. Example: replicas were broken for 5 hours, and then master failed. One might want to prevent the failover in order to recover the binary logs / relay logs for those lost 5 hours. To use this flag, you must set ReplicationLagQuery and use a heartbeat mechanism such as pt-heartbeat. The MySQL built-in Seconds_behind_master output of SHOW SLAVE STATUS (pre 8.0) does not report replication lag when replication is broken.

I wonder, can we use Seconds_Behind_Master for ReplicationLagQuery to determine replication lagging? I currently use MariaDB 10.5.11(Release date: 23 Jun 2021)

see: https://mariadb.com/kb/en/show-replica-status/#column-descriptions

Jul 19 '21 06:07 leiless

I wonder, can we use Seconds_Behind_Master for ReplicationLagQuery to determine replication lagging?

If you don't specify ReplicationLagQuery, then orchestrator uses Seconds_Behind_Master by default.

Jul 19 '21 06:07 shlomi-noach

Thanks for your reply!

One more thing, is Seconds_Behind_Master reliable to determine replication lagging? I saw some blog posts that said that it's not reliable to use(better to use pt-heartbeat). Sorry, I'm a newbie.

Jul 19 '21 09:07 leiless

It is not reliable in my experience. See http://code.openark.org/blog/mysql/seconds_behind_master-vs-absolute-slave-lag

Jul 19 '21 10:07 shlomi-noach

Thanks! https://code.openark.org/blog/mysql/seconds_behind_master-vs-absolute-slave-lag

Jul 19 '21 10:07 leiless

I wonder, can we use Seconds_Behind_Master for ReplicationLagQuery to determine replication lagging?

If you don't specify ReplicationLagQuery, then orchestrator uses Seconds_Behind_Master by default.

Aug 17 11:57:22 a.test.com orchestrator[3040]: 2021-08-17 11:57:22 FATAL nonzero FailMasterPromotionOnLagMinutes requires ReplicationLagQuery to be set

https://github.com/openark/orchestrator/blob/master/go/config/config.go#L544 https://github.com/openark/orchestrator/blob/master/docs/using-the-web-api.md

It's not true, I'm using orchestrator 3.2.6 and MariaDB 10.5.12.

Aug 17 '21 05:08 leiless

Seems that I cannot extract Seconds_Behind_Master solely from the output of SHOW SLAVE STATUS (As of MariaDB 10.5.12) ~~SELECT t.Seconds_Behind_Master FROM (SHOW SLAVE STATUS) AS t;~~ won't work. [MDEV-11123] Seconds_Behind_Master is not accessible through information_schema - Jira

Aug 17 '21 06:08 leiless

According to https://github.com/openark/orchestrator/blob/master/go/inst/instance_dao.go#L640-L654

if config.Config.ReplicationLagQuery != "" && !isMaxScale {
	waitGroup.Add(1)
	go func() {
		defer waitGroup.Done()
		if err := db.QueryRow(config.Config.ReplicationLagQuery).Scan(&instance.ReplicationLagSeconds); err == nil {
			if instance.ReplicationLagSeconds.Valid && instance.ReplicationLagSeconds.Int64 < 0 {
				log.Warningf("Host: %+v, instance.SlaveLagSeconds < 0 [%+v], correcting to 0", instanceKey, instance.ReplicationLagSeconds.Int64)
				instance.ReplicationLagSeconds.Int64 = 0
			}
		} else {
			instance.ReplicationLagSeconds = instance.SecondsBehindMaster
			logReadTopologyInstanceError(instanceKey, "ReplicationLagQuery", err)
		}
	}()
}

If we fail db.QueryRow(config.Config.ReplicationLagQuery).Scan(&instance.ReplicationLagSeconds) deliberately, ReplicationLagSeconds will use SecondsBehindMaster as fallback, which is desired.

However, the method aforementioned has one side-effect that, it'll do a harmless error log.

"ReplicationLagQuery": "SELECT 'see: https://github.com/openark/orchestrator/issues/1388#issuecomment-900014232'",

2021-08-18 11:58:31 ERROR ReadTopologyInstance(mariadb-13307:3306) ReplicationLagQuery: sql: Scan error on column index 0, name "see: https://github.com/openark/orchestrator/issues/1388#issuecomment-900014232": converting driver.Value type []uint8 ("see: https://github.com/openark/orchestrator/issues/1388#issuecomment-900014232") to a int64: invalid syntax

also, the code may be subject to change.

Aug 17 '21 07:08 leiless

I wonder, can we use Seconds_Behind_Master for ReplicationLagQuery to determine replication lagging?

If you don't specify ReplicationLagQuery, then orchestrator uses Seconds_Behind_Master by default.

Aug 17 11:57:22 a.test.com orchestrator[3040]: 2021-08-17 11:57:22 FATAL nonzero FailMasterPromotionOnLagMinutes requires ReplicationLagQuery to be set

https://github.com/openark/orchestrator/blob/master/go/config/config.go#L544 https://github.com/openark/orchestrator/blob/master/docs/using-the-web-api.md

It's not true, I'm using orchestrator 3.2.6 and MariaDB 10.5.12.

@shlomi-noach, can we propose a way to indicate ReplicationLagQuery to use Seconds_Behind_Master directly? instead of the hacking way.

For example: "ReplicationLagQuery": "-- Seconds_Behind_Master", to use Seconds_Behind_Master from SHOW SLAVE STATUS directly.

Aug 17 '21 07:08 leiless