kibana icon indicating copy to clipboard operation
kibana copied to clipboard

[ResponseOps] Errors during marking tasks as running are not shown in metrics

Open doakalexi opened this issue 1 year ago • 4 comments

Resolves https://github.com/elastic/kibana/issues/184171

Summary

Errors are not shown in metrics when Elasticsearch returns an error during markAsRunning (changes status from claiming to running) operation in TaskManager. This PR updates the TaskManager to throw an error instead of just logging it.

Checklist

To verify

  1. Create an Always Firing rule.
  2. Put the below code in the try block of TaskStore.bulkUpdate method to mimic markAsRunning
      const isMarkAsRunning = docs.some(
        (doc) =>
          doc.taskType === 'alerting:example.always-firing' &&
          doc.status === 'running' &&
          doc.retryAt !== null
      );
      if (isMarkAsRunning) {
        throw SavedObjectsErrorHelpers.decorateEsUnavailableError(new Error('test'));
      }
  1. Verify that when the above error is thrown, it is reflected in metrics endpoint results.

doakalexi avatar Aug 26 '24 17:08 doakalexi

/ci

doakalexi avatar Aug 26 '24 17:08 doakalexi

/ci

doakalexi avatar Aug 26 '24 17:08 doakalexi

Pinging @elastic/response-ops (Team:ResponseOps)

elasticmachine avatar Aug 26 '24 21:08 elasticmachine

@mikecote helped me with the testing and I wanted to share what we did in case @ymao1 or @ersin-erdal want to verify.

diff --git a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
index 217b03135f5..2d2028e8e6e 100644
--- a/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/task_pool.ts
@@ -137,7 +137,9 @@ export class TaskPool {
       availableCapacity
     );

+    let counter = 0;
     if (tasksToRun.length) {
+      console.log(`*** Mark as running ${tasksToRun.length} task(s)`);
       await Promise.all(
         tasksToRun
           .filter(
@@ -147,6 +149,11 @@ export class TaskPool {
               )
           )
           .map(async (taskRunner) => {
+            if (counter++ % 2 !== 0) {
+              console.log(`*** Going to fail markTaskAsRunning() for ${taskRunner.id}`);
+              throw new Error('oops');
+            }
+            console.log(`*** Going to succeed markTaskAsRunning() for ${taskRunner.id}`);
             // We use taskRunner.taskExecutionId instead of taskRunner.id as key for the task pool map because
             // task cancellation is a non-blocking procedure. We calculate the expiration and immediately remove
             // the task from the task pool. There is a race condition that can occur when a recurring tasks's schedule
diff --git a/x-pack/plugins/task_manager/server/task_running/task_runner.ts b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
index bfcabed9f6e..12812288f5c 100644
--- a/x-pack/plugins/task_manager/server/task_running/task_runner.ts
+++ b/x-pack/plugins/task_manager/server/task_running/task_runner.ts
@@ -372,6 +372,7 @@ export class TaskManagerRunner implements TaskRunner {
         description: 'run task',
       };

+      console.log(`*** Running task ${this.id}`);
       const result = await this.executionContext.withContext(ctx, () =>
         withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run())
       );
       

The output should look something like this

*** Mark as running 5 task(s)
*** Going to succeed markTaskAsRunning() for endpoint:complete-external-response-actions-1.0.0
*** Going to fail markTaskAsRunning() for apm-source-map-migration-task-id
*** Going to succeed markTaskAsRunning() for Actions-actions_telemetry
*** Going to fail markTaskAsRunning() for Dashboard-dashboard_telemetry
*** Going to succeed markTaskAsRunning() for observabilityAIAssistant:indexQueuedDocumentsTask
[2024-09-03T13:29:24.674-04:00][ERROR][plugins.taskManager] Failed to poll for work: Error: oops
*** Running task endpoint:complete-external-response-actions-1.0.0
*** Running task Actions-actions_telemetry
*** Running task observabilityAIAssistant:indexQueuedDocumentsTask

doakalexi avatar Sep 03 '24 17:09 doakalexi

:yellow_heart: Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #14 / EditableMarkdown Save button click calls onSaveContent and onChangeEditable when text area value changed

Metrics [docs]

✅ unchanged

History

  • :yellow_heart: Build #231713 was flaky efd79a2b10f31f9e2a47e9433b2664ad13153924
  • :green_heart: Build #230934 succeeded a226f3e936c7d785f1e244062460a43281e6741c
  • :yellow_heart: Build #229946 was flaky de5a8110c878432d1d447b2edd03cc355710019c
  • :green_heart: Build #229878 succeeded 8212d4be24ce6d3cd1cafdfb41cd47820d04f696

To update your PR or re-run it, just comment with: @elasticmachine merge upstream

kibana-ci avatar Sep 04 '24 16:09 kibana-ci