cockroach
cockroach copied to clipboard
[DRAFT] jobs: state change structured logs
NOTE: Only consider the final commit
This is a WIP aimed at getting some early feedback regarding the fields we want to include in a structured log used to track job state changes.
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
@abarganier for disaster recovery can we also include the fields from SHOW BACKUP https://www.cockroachlabs.com/docs/v24.1/show-backup.html as well (Assuming that based on Kevin's previous comments the fields from SHOW JOBS will also be returned which would also include the start and end time of the job)
cc @benbardin
SHOW BACKUP
I'm not sure I understand what is being suggested here. This PR is adding a log of job state changes -- when it was paused, when it was resumed, when its status message was changed. This is generic to all jobs -- any job that is paused or resumed or changes its status message will cause such an event to be logged, and such events are applicable to all jobs.
SHOW BACKUP isn't a job; it shows the content of a specific backup chain by listing its manifests, as they currently exist in the CSP bucket. I'm not sure that really is related to this log of job events?
I think she's asking for fields that would show up in SHOW BACKUP to be present in the job that created that backup?
I don't think that is a jobs change -- that'd be a backup change.
This change is about adding logging of job changes to the jobs system.
I imagine if we have a bunch of backup specific events, we'd probably just want to use the backup events in their own log. Indeed, we already have structured logs specific to backups that record details of the backup here: https://www.cockroachlabs.com/docs/stable/eventlog#recovery_event
It might be a better UX to have a single Job event that captures job statuses, changes, and all its metadata. This single event would also have job type specific information in another field (e.g., specific running statuses, backup_type, changefeed sink_uri, etc.).
We'd avoid having to join across multiple logs in our internal tool (which doesn't have great join capabilities yet) to understand what job was running at a given time, its job type, and then jobs type-specific information for further troubleshooting.
Thoughts?
Also,
we already have structured logs specific to backups that record details of the backup here: https://www.cockroachlabs.com/docs/stable/eventlog#recovery_event
One thing to call out is that the recovery event was introduced for internal product analytics via the TELEMETRY channel. We haven't vetted or determined whether we want end users to consume this log - in case we were targeting an external customer here. FYI for Alicia when she's back.
Since I'm taking over ownership of this work, I put up a new PR, https://github.com/cockroachdb/cockroach/pull/125319.
@kevin-v-ngo @alicia-l2. Can we continue any conversations there?
@msbutler I've addressed some of the comments you left in my new PR, as well as removed some of the fields that existed in the JobStateChange
struct.
@abarganier I think you can close this PR now.
@kyle-a-wong I guess you meant @abarganier instead.
Thanks @kyle-a-wong - closing.