[ResponseOps] Errors during marking tasks as running are not shown in metrics
Resolves https://github.com/elastic/kibana/issues/184171
Summary
Errors are not shown in metrics when Elasticsearch returns an error during markAsRunning (changes status from claiming to running) operation in TaskManager. This PR updates the TaskManager to throw an error instead of just logging it.
Checklist
- [ ] Unit or functional tests were updated or added to match the most common scenarios
To verify
- Create an Always Firing rule.
- Put the below code in the try block of TaskStore.bulkUpdate method to mimic markAsRunning
const isMarkAsRunning = docs.some(
(doc) =>
doc.taskType === 'alerting:example.always-firing' &&
doc.status === 'running' &&
doc.retryAt !== null
);
if (isMarkAsRunning) {
throw SavedObjectsErrorHelpers.decorateEsUnavailableError(new Error('test'));
}
- Verify that when the above error is thrown, it is reflected in metrics endpoint results.
/ci
/ci
Pinging @elastic/response-ops (Team:ResponseOps)
@mikecote helped me with the testing and I wanted to share what we did in case @ymao1 or @ersin-erdal want to verify.
diff --git a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
index 217b03135f5..2d2028e8e6e 100644
--- a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
@@ -137,7 +137,9 @@ export class TaskPool {
availableCapacity
);
+ let counter = 0;
if (tasksToRun.length) {
+ console.log(`*** Mark as running ${tasksToRun.length} task(s)`);
await Promise.all(
tasksToRun
.filter(
@@ -147,6 +149,11 @@ export class TaskPool {
)
)
.map(async (taskRunner) => {
+ if (counter++ % 2 !== 0) {
+ console.log(`*** Going to fail markTaskAsRunning() for ${taskRunner.id}`);
+ throw new Error('oops');
+ }
+ console.log(`*** Going to succeed markTaskAsRunning() for ${taskRunner.id}`);
// We use taskRunner.taskExecutionId instead of taskRunner.id as key for the task pool map because
// task cancellation is a non-blocking procedure. We calculate the expiration and immediately remove
// the task from the task pool. There is a race condition that can occur when a recurring tasks's schedule
diff --git a/x-pack/plugins/task_manager/server/task_running/task_runner.ts b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
index bfcabed9f6e..12812288f5c 100644
--- a/x-pack/plugins/task_manager/server/task_running/task_runner.ts
+++ b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
@@ -372,6 +372,7 @@ export class TaskManagerRunner implements TaskRunner {
description: 'run task',
};
+ console.log(`*** Running task ${this.id}`);
const result = await this.executionContext.withContext(ctx, () =>
withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run())
);
The output should look something like this
*** Mark as running 5 task(s)
*** Going to succeed markTaskAsRunning() for endpoint:complete-external-response-actions-1.0.0
*** Going to fail markTaskAsRunning() for apm-source-map-migration-task-id
*** Going to succeed markTaskAsRunning() for Actions-actions_telemetry
*** Going to fail markTaskAsRunning() for Dashboard-dashboard_telemetry
*** Going to succeed markTaskAsRunning() for observabilityAIAssistant:indexQueuedDocumentsTask
[2024-09-03T13:29:24.674-04:00][ERROR][plugins.taskManager] Failed to poll for work: Error: oops
*** Running task endpoint:complete-external-response-actions-1.0.0
*** Running task Actions-actions_telemetry
*** Running task observabilityAIAssistant:indexQueuedDocumentsTask
:yellow_heart: Build succeeded, but was flaky
- Buildkite Build
- Commit: 58b2bb5561956c6e22cfdcaf8994819ddd3689c3
Failed CI Steps
Test Failures
- [job] [logs] Jest Tests #14 / EditableMarkdown Save button click calls onSaveContent and onChangeEditable when text area value changed
Metrics [docs]
✅ unchanged
History
- :yellow_heart: Build #231713 was flaky efd79a2b10f31f9e2a47e9433b2664ad13153924
- :green_heart: Build #230934 succeeded a226f3e936c7d785f1e244062460a43281e6741c
- :yellow_heart: Build #229946 was flaky de5a8110c878432d1d447b2edd03cc355710019c
- :green_heart: Build #229878 succeeded 8212d4be24ce6d3cd1cafdfb41cd47820d04f696
To update your PR or re-run it, just comment with:
@elasticmachine merge upstream