Feature Request: Add Error Status Field for Diskless Syncs
The problem/use-case that the feature addresses
Currently, Valkey has a lastbgsave_status field that tracks the status of disk-based bgsave. However, there is no equivalent field or status indicator for diskless sync operations. This lack of visibility into diskless sync errors makes it difficult to monitor and troubleshoot issues related to these operations.
Description of the feature
Introduce a new field or status indicator, tentatively named lastbgsave_diskless_status, to track the status of diskless sync operations. This field should be updated with an appropriate error code or message whenever an error occurs during the diskless sync process.
Alternatives you've considered
-
Logging errors: Instead of introducing a new field, errors during diskless sync operations could be logged. However, this approach would require parsing logs to identify and monitor errors, which can be less efficient than having a dedicated status field.
-
Reusing
lastbgsave_status: Another alternative would be to reuse the existinglastbgsave_statusfield for both disk-based and diskless sync operations. However, this could lead to confusion and make it harder to distinguish between different types of errors. It also may make tests which already uses the metric to do wrong assertions.
I mentioned here https://github.com/valkey-io/valkey-doc/pull/158 wanting to add these fields, and now seems like a good time to do it.
@valkey-io/core-team please take a look at the doc PR link, and see if we want to add the diskless related fields.
I just want to confirm with you:
- the Diskless Sync meaning when repl-diskless-sync is set to yes, the primary send rdb to replica status?
- with which condition, there is error status? Could you please list most situations?
- the name lastbgsave_diskless_status is not properly, suggest to repl-diskless-sync-status or something else because for diskless-sync, there is no save file on disk
- Yes, you are correct. When
repl-diskless-syncis set toyes, the primary sends the RDB file directly to the replica's socket, without saving it to disk on the primary side. - There are several situations where an error status can occur during the diskless sync process:
- Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
- Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
- Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
- Some issue at the replica side that prevents it from receiving or storing the RDB data.
- I agree.
repl-diskless-sync-statusis a better name for the status variable.
-
- Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
-
- Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
-
- Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
-
- Some issue at the replica side that prevents it from receiving or storing the RDB data.
For case 2 and 4, I think it makes sense to add repl-diskless-sync-status. But for case 1 and 3, I am not familiar with this kind of case. @enjoy-binbin How about you?
I think both cases can happend, as long as the situations will cause diskless-sync fail, we should set the repl-diskless-sync-status.
@valkey-io/core-team please take a look and check if this needs to be fit into 8.0
Here are the fields we have now and their definitions, we can see we are mixing disk-based RDB and diskless RDB in some fields, and rdb_last_bgsave_status does not include the diskless RDB
rdb_changes_since_last_save: Number of changes since the last RDB file saverdb_bgsave_in_progress: Flag indicating a RDB save is on-going, including a diskless replication RDB saverdb_last_save_time: Epoch-based timestamp of last successful RDB file saverdb_last_bgsave_status: Status of the last RDB file save operationrdb_last_bgsave_time_sec: Duration of the last RDB save operation in seconds, including a diskless replication RDB saverdb_current_bgsave_time_sec: Duration of the on-going RDB save operation if any, including a diskless replication RDB save