cockroach
cockroach copied to clipboard
ccl/backupccl: add log-based telemetry to backup and restore
trafficstars
Backport 1/1 commits from #82463.
/cc @cockroachdb/release
Previously we didn't have logging for backup, backup schedule, and restore events in the telemetry structured logs. These logs are needed as they will be exported to Snowflake as part the new telemetry system. This change adds the logging of these events for every invoked backup and restore, and whenever a backup schedule is created. These events have the following format:
| Field name | Field type | Example Value | Description | Field is valid for |
|---|---|---|---|---|
| recovery_type | predefined list | {backup, scheduled backup, restore} | Did the user use backup, restore or scheduled backup? | all events |
| target_scope | predefined list | {cluster, database, table, schema} | What is the scope of target object is the user backing up / restoring? | all events |
| is_multiregion_target | bool | true, false | Does the target contain objects with multi-region primitives? | all events |
| target_count | int | 3 | How many targets (databases, clusters, etc) is the user backing up/restoring? | all events |
| destination_subdir_type | predefined list | {custom, standard, latest} | custom = custom name for their sub directory, standard = date-based sub-dir, latest = latest subdir | all events |
| destination_storage_types | predefined list | {aws, gs, azure, http, nodelocal, userfile, other} | What is the cloud storage that the user is writing this backup to / restoring from? | all events |
| destination_auth_types | predefined list | {implicit, specified, other} | What authentication is used to access the cloud storage that the user wants to write the backup to / restore from? | all events |
| is_locality_aware | bool | true, false | Is this backup / restore locality aware? | all events |
| as_of_interval | int | relative time passed in AOST flag (e.g. -10s) | What system time does the use want to run this backup / restore as of? | all events |
| with_revision_history | bool | true, false | Does the backup include revision history? | all events |
| has_encryption_passphrase | bool | true, false | Did the user provide an encryption passphrase to encrypt / decrypt their backup? | all events |
| is_detached | bool | true, false | Did the user take a backup / restore with detached flag? | all events |
| kms_type | predefined list | {aws, gcp, other, none} | Did the user provide a KMS to encrypt/decrypt backup? Which KMS? | all events |
| kms_count | int | 2 | Did the user provide multiple KMSs to encrypt / decrypt the backup? How many? | all events |
| result_status | predefined list | succeeded, failed, canceled | What was the result code of the backup - did it succeed, fail? | all events |
| error_text | string | custom | What was the reason for failure? | all events |
| recurring_cron | string | default, custom crontab string (e.g. 1d) | How often does the user want to take a backup? (full or inc) | scheduled backups |
| full_backup_cron | string | default, always, custom crontab string (e.g. 1w) | How often does the user want to take a full backup? | scheduled backups |
| custom_first_run_time | int | timestamp | Did the user configure a custom first run time? | scheduled backups |
| on_execution_failure | predefined list | {retry, reschedule, pause, other} | What does the user want to do if the schedule fails to execute? | scheduled backups |
| on_previous_running | predefined list | {start, skip, wait} | What does the user want to do if the previous scheduld backup is still running? | scheduled backups |
| ignore_existing_backup | bool | {true, false} | If backups were already created in the destination that the new schedule references, is the new schedule backing up different objects? | scheduled backups |
| restore_options | list of predefined strings | ["into_db", "skip_missing_fk"] | Which restore options did the user use? | restore |
| into_db | entry in restore_options | restore_options : ["into_db"] | Did the user provide a new DB to restore the table(s) to? | restore (table-level) |
| rename_db | entry in restore_options | restore_options : ["rename_db"] | Did the user provide a new name for the restored DB? | restore (database-level) |
| skip_missing_fk | entry in restore_options | restore_options : ["skip_missing_fk"] | Does the user want to skip missing foreign keys on restore? | restore |
| skip_missing_sequences | entry in restore_options | restore_options : ["skip_missing_sequences"] | Doe the user want to skip missing sequences on restore? | restore |
| skip_missing_views | entry in restore_options | restore_options : ["skip_missing_views"] | Does the user want to skip missing views on restore? | restore |
| skip_localities_check | entry in restore_options | restore_options : ["skip_localities_check"] | Does the user want to skip check for mis-matching localities on restore? | restore |
| debug_pause_on | predefined list | {error} | Does the user want to pause the restore if an error occurs? | restore |
Release note (enterprise change): Backup, restore, and backup schedule creation now have corresponding events that are emitted to the telemetry channel.
@livlobo has confirmed this is a high priority addition to 22.1? That's correct!