valkey Feature Request: Add Error Status Field for Diskless Syncs

The problem/use-case that the feature addresses

Currently, Valkey has a lastbgsave_status field that tracks the status of disk-based bgsave. However, there is no equivalent field or status indicator for diskless sync operations. This lack of visibility into diskless sync errors makes it difficult to monitor and troubleshoot issues related to these operations.

Description of the feature

Introduce a new field or status indicator, tentatively named lastbgsave_diskless_status, to track the status of diskless sync operations. This field should be updated with an appropriate error code or message whenever an error occurs during the diskless sync process.

Alternatives you've considered

Logging errors: Instead of introducing a new field, errors during diskless sync operations could be logged. However, this approach would require parsing logs to identify and monitor errors, which can be less efficient than having a dedicated status field.
Reusing lastbgsave_status: Another alternative would be to reuse the existing lastbgsave_status field for both disk-based and diskless sync operations. However, this could lead to confusion and make it harder to distinguish between different types of errors. It also may make tests which already uses the metric to do wrong assertions.

Aug 08 '24 09:08 naglera

I mentioned here https://github.com/valkey-io/valkey-doc/pull/158 wanting to add these fields, and now seems like a good time to do it.

@valkey-io/core-team please take a look at the doc PR link, and see if we want to add the diskless related fields.

Aug 08 '24 10:08 enjoy-binbin

I just want to confirm with you:

the Diskless Sync meaning when repl-diskless-sync is set to yes, the primary send rdb to replica status?
with which condition, there is error status? Could you please list most situations?
the name lastbgsave_diskless_status is not properly, suggest to repl-diskless-sync-status or something else because for diskless-sync, there is no save file on disk

Aug 12 '24 16:08 hwware

Yes, you are correct. When repl-diskless-sync is set to yes, the primary sends the RDB file directly to the replica's socket, without saving it to disk on the primary side.
There are several situations where an error status can occur during the diskless sync process:
- Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
- Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
- Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
- Some issue at the replica side that prevents it from receiving or storing the RDB data.
I agree. repl-diskless-sync-status is a better name for the status variable.

Aug 13 '24 14:08 naglera

- Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
- Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
- Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
- Some issue at the replica side that prevents it from receiving or storing the RDB data.

For case 2 and 4, I think it makes sense to add repl-diskless-sync-status. But for case 1 and 3, I am not familiar with this kind of case. @enjoy-binbin How about you?

Aug 14 '24 19:08 hwware

I think both cases can happend, as long as the situations will cause diskless-sync fail, we should set the repl-diskless-sync-status.

Aug 15 '24 04:08 enjoy-binbin

@valkey-io/core-team please take a look and check if this needs to be fit into 8.0

Here are the fields we have now and their definitions, we can see we are mixing disk-based RDB and diskless RDB in some fields, and rdb_last_bgsave_status does not include the diskless RDB

rdb_changes_since_last_save: Number of changes since the last RDB file save
rdb_bgsave_in_progress: Flag indicating a RDB save is on-going, including a diskless replication RDB save
rdb_last_save_time: Epoch-based timestamp of last successful RDB file save
rdb_last_bgsave_status: Status of the last RDB file save operation
rdb_last_bgsave_time_sec: Duration of the last RDB save operation in seconds, including a diskless replication RDB save
rdb_current_bgsave_time_sec: Duration of the on-going RDB save operation if any, including a diskless replication RDB save

Aug 26 '24 15:08 enjoy-binbin