cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

[DRAFT] jobs: state change structured logs

Open abarganier opened this issue 9 months ago • 2 comments

NOTE: Only consider the final commit

This is a WIP aimed at getting some early feedback regarding the fields we want to include in a structured log used to track job state changes.

abarganier avatar May 15 '24 19:05 abarganier

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar May 15 '24 19:05 blathers-crl[bot]

This change is Reviewable

cockroach-teamcity avatar May 15 '24 19:05 cockroach-teamcity

@abarganier for disaster recovery can we also include the fields from SHOW BACKUP https://www.cockroachlabs.com/docs/v24.1/show-backup.html as well (Assuming that based on Kevin's previous comments the fields from SHOW JOBS will also be returned which would also include the start and end time of the job)

cc @benbardin

alicia-l2 avatar Jun 04 '24 01:06 alicia-l2

SHOW BACKUP

I'm not sure I understand what is being suggested here. This PR is adding a log of job state changes -- when it was paused, when it was resumed, when its status message was changed. This is generic to all jobs -- any job that is paused or resumed or changes its status message will cause such an event to be logged, and such events are applicable to all jobs.

SHOW BACKUP isn't a job; it shows the content of a specific backup chain by listing its manifests, as they currently exist in the CSP bucket. I'm not sure that really is related to this log of job events?

dt avatar Jun 04 '24 19:06 dt

I think she's asking for fields that would show up in SHOW BACKUP to be present in the job that created that backup?

benbardin avatar Jun 04 '24 19:06 benbardin

I don't think that is a jobs change -- that'd be a backup change.

This change is about adding logging of job changes to the jobs system.

I imagine if we have a bunch of backup specific events, we'd probably just want to use the backup events in their own log. Indeed, we already have structured logs specific to backups that record details of the backup here: https://www.cockroachlabs.com/docs/stable/eventlog#recovery_event

dt avatar Jun 04 '24 19:06 dt

It might be a better UX to have a single Job event that captures job statuses, changes, and all its metadata. This single event would also have job type specific information in another field (e.g., specific running statuses, backup_type, changefeed sink_uri, etc.).

We'd avoid having to join across multiple logs in our internal tool (which doesn't have great join capabilities yet) to understand what job was running at a given time, its job type, and then jobs type-specific information for further troubleshooting.

Thoughts?

Also,

we already have structured logs specific to backups that record details of the backup here: https://www.cockroachlabs.com/docs/stable/eventlog#recovery_event

One thing to call out is that the recovery event was introduced for internal product analytics via the TELEMETRY channel. We haven't vetted or determined whether we want end users to consume this log - in case we were targeting an external customer here. FYI for Alicia when she's back.

kevin-v-ngo avatar Jun 05 '24 01:06 kevin-v-ngo

Since I'm taking over ownership of this work, I put up a new PR, https://github.com/cockroachdb/cockroach/pull/125319.

@kevin-v-ngo @alicia-l2. Can we continue any conversations there?

@msbutler I've addressed some of the comments you left in my new PR, as well as removed some of the fields that existed in the JobStateChange struct.

@abarganier I think you can close this PR now.

kyle-a-wong avatar Jun 10 '24 16:06 kyle-a-wong

@kyle-a-wong I guess you meant @abarganier instead.

alex-berger avatar Jun 10 '24 17:06 alex-berger

Thanks @kyle-a-wong - closing.

abarganier avatar Jun 10 '24 17:06 abarganier