valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Feature Request: Add Error Status Field for Diskless Syncs

Open naglera opened this issue 1 year ago • 6 comments

The problem/use-case that the feature addresses

Currently, Valkey has a lastbgsave_status field that tracks the status of disk-based bgsave. However, there is no equivalent field or status indicator for diskless sync operations. This lack of visibility into diskless sync errors makes it difficult to monitor and troubleshoot issues related to these operations.

Description of the feature

Introduce a new field or status indicator, tentatively named lastbgsave_diskless_status, to track the status of diskless sync operations. This field should be updated with an appropriate error code or message whenever an error occurs during the diskless sync process.

Alternatives you've considered

  1. Logging errors: Instead of introducing a new field, errors during diskless sync operations could be logged. However, this approach would require parsing logs to identify and monitor errors, which can be less efficient than having a dedicated status field.

  2. Reusing lastbgsave_status: Another alternative would be to reuse the existing lastbgsave_status field for both disk-based and diskless sync operations. However, this could lead to confusion and make it harder to distinguish between different types of errors. It also may make tests which already uses the metric to do wrong assertions.

naglera avatar Aug 08 '24 09:08 naglera

I mentioned here https://github.com/valkey-io/valkey-doc/pull/158 wanting to add these fields, and now seems like a good time to do it.

@valkey-io/core-team please take a look at the doc PR link, and see if we want to add the diskless related fields.

enjoy-binbin avatar Aug 08 '24 10:08 enjoy-binbin

I just want to confirm with you:

  1. the Diskless Sync meaning when repl-diskless-sync is set to yes, the primary send rdb to replica status?
  2. with which condition, there is error status? Could you please list most situations?
  3. the name lastbgsave_diskless_status is not properly, suggest to repl-diskless-sync-status or something else because for diskless-sync, there is no save file on disk

hwware avatar Aug 12 '24 16:08 hwware

  1. Yes, you are correct. When repl-diskless-sync is set to yes, the primary sends the RDB file directly to the replica's socket, without saving it to disk on the primary side.
  2. There are several situations where an error status can occur during the diskless sync process:
    • Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
    • Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
    • Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
    • Some issue at the replica side that prevents it from receiving or storing the RDB data.
  3. I agree. repl-diskless-sync-status is a better name for the status variable.

naglera avatar Aug 13 '24 14:08 naglera

    • Short write: If the child process responsible for sending the RDB data encounters a short write while writing to the pipe.
    • Out of Memory: If the child process runs out of memory while creating the RDB file or sending it to the replica..
    • Network issues: If there are network problems or the connection to the replica is lost during the diskless sync process (when using dual-channel replication).
    • Some issue at the replica side that prevents it from receiving or storing the RDB data.

For case 2 and 4, I think it makes sense to add repl-diskless-sync-status. But for case 1 and 3, I am not familiar with this kind of case. @enjoy-binbin How about you?

hwware avatar Aug 14 '24 19:08 hwware

I think both cases can happend, as long as the situations will cause diskless-sync fail, we should set the repl-diskless-sync-status.

enjoy-binbin avatar Aug 15 '24 04:08 enjoy-binbin

@valkey-io/core-team please take a look and check if this needs to be fit into 8.0

Here are the fields we have now and their definitions, we can see we are mixing disk-based RDB and diskless RDB in some fields, and rdb_last_bgsave_status does not include the diskless RDB

  • rdb_changes_since_last_save: Number of changes since the last RDB file save
  • rdb_bgsave_in_progress: Flag indicating a RDB save is on-going, including a diskless replication RDB save
  • rdb_last_save_time: Epoch-based timestamp of last successful RDB file save
  • rdb_last_bgsave_status: Status of the last RDB file save operation
  • rdb_last_bgsave_time_sec: Duration of the last RDB save operation in seconds, including a diskless replication RDB save
  • rdb_current_bgsave_time_sec: Duration of the on-going RDB save operation if any, including a diskless replication RDB save

enjoy-binbin avatar Aug 26 '24 15:08 enjoy-binbin