return-dispatch icon indicating copy to clipboard operation
return-dispatch copied to clipboard

Exhausted searching IDs in known runs

Open jduan-highnote opened this issue 1 year ago • 46 comments

Hi,

From time to time, we would see this error "Exhausted searching IDs in known runs". See the logs below. I bet this is caused by github API not finding the relevant run. That said, is there anything we can do to mitigate this problem? Thanks!

Mon, 07 Aug 2023 16:14:00 GMT Fetching Workflow ID for smokes-cron.yml...
Mon, 07 Aug 2023 16:14:00 GMT Fetched Workflow ID: 7198060
Mon, 07 Aug 2023 16:14:01 GMT Successfully dispatched workflow:
Mon, 07 Aug 2023 16:14:01 GMT   Repository: bay1inc/api-platform
Mon, 07 Aug 2023 16:14:01 GMT   Branch: refs/heads/main
Mon, 07 Aug 2023 16:14:01 GMT   Workflow ID: smokes-cron.yml
Mon, 07 Aug 2023 16:14:01 GMT   Workflow Inputs: {"envs":"staging-test,staging-live"}
Mon, 07 Aug 2023 16:14:01 GMT   Distinct ID: 29fc4ecb-c9d5-44d8-9216-aa22fc11ff89
Mon, 07 Aug 2023 16:14:01 GMT Attempt to extract run ID from steps...
Mon, 07 Aug 2023 16:14:03 GMT Exhausted searching IDs in known runs, attempt 1...
Mon, 07 Aug 2023 16:14:11 GMT Exhausted searching IDs in known runs, attempt 2...
Mon, 07 Aug 2023 16:14:18 GMT Exhausted searching IDs in known runs, attempt 3...
Mon, 07 Aug 2023 16:14:26 GMT Exhausted searching IDs in known runs, attempt 4...
Mon, 07 Aug 2023 16:14:33 GMT Exhausted searching IDs in known runs, attempt 5...
Mon, 07 Aug 2023 16:14:41 GMT Exhausted searching IDs in known runs, attempt 6...
Mon, 07 Aug 2023 16:14:48 GMT Exhausted searching IDs in known runs, attempt 7...
Mon, 07 Aug 2023 16:14:56 GMT Exhausted searching IDs in known runs, attempt 8...
Mon, 07 Aug 2023 16:15:03 GMT Exhausted searching IDs in known runs, attempt 9...
Mon, 07 Aug 2023 16:15:11 GMT Exhausted searching IDs in known runs, attempt 10...
Mon, 07 Aug 2023 16:15:18 GMT Exhausted searching IDs in known runs, attempt 11...
Mon, 07 Aug 2023 16:15:26 GMT Exhausted searching IDs in known runs, attempt 12...
Mon, 07 Aug 2023 16:15:33 GMT Exhausted searching IDs in known runs, attempt 13...
Mon, 07 Aug 2023 16:15:41 GMT Exhausted searching IDs in known runs, attempt 14...
Mon, 07 Aug 2023 16:15:48 GMT Exhausted searching IDs in known runs, attempt 15...
Mon, 07 Aug 2023 16:15:55 GMT Exhausted searching IDs in known runs, attempt 16...
Mon, 07 Aug 2023 16:16:04 GMT Exhausted searching IDs in known runs, attempt 17...
Mon, 07 Aug 2023 16:16:12 GMT Exhausted searching IDs in known runs, attempt 18...
Mon, 07 Aug 2023 16:16:20 GMT Exhausted searching IDs in known runs, attempt 19...
Mon, 07 Aug 2023 16:16:27 GMT Exhausted searching IDs in known runs, attempt 20...
Mon, 07 Aug 2023 16:16:35 GMT Exhausted searching IDs in known runs, attempt 21...
Mon, 07 Aug 2023 16:16:42 GMT Exhausted searching IDs in known runs, attempt 22...
Mon, 07 Aug 2023 16:16:55 GMT Exhausted searching IDs in known runs, attempt 23...
Mon, 07 Aug 2023 16:16:57 GMT Exhausted searching IDs in known runs, attempt 24...
Mon, 07 Aug 2023 16:17:05 GMT Exhausted searching IDs in known runs, attempt 25...
Mon, 07 Aug 2023 16:17:12 GMT Exhausted searching IDs in known runs, attempt 26...
Mon, 07 Aug 2023 16:17:20 GMT Exhausted searching IDs in known runs, attempt 27...
Mon, 07 Aug 2023 16:17:27 GMT Exhausted searching IDs in known runs, attempt 28...
Mon, 07 Aug 2023 16:17:35 GMT Exhausted searching IDs in known runs, attempt 29...
Mon, 07 Aug 2023 16:17:42 GMT Exhausted searching IDs in known runs, attempt 30...
Mon, 07 Aug 2023 16:17:50 GMT Exhausted searching IDs in known runs, attempt 31...
Mon, 07 Aug 2023 16:17:57 GMT Exhausted searching IDs in known runs, attempt 32...
Mon, 07 Aug 2023 16:18:05 GMT Exhausted searching IDs in known runs, attempt 33...
Mon, 07 Aug 2023 16:18:13 GMT Exhausted searching IDs in known runs, attempt 34...
Mon, 07 Aug 2023 16:18:20 GMT Exhausted searching IDs in known runs, attempt 35...
Mon, 07 Aug 2023 16:18:28 GMT Exhausted searching IDs in known runs, attempt 36...
Mon, 07 Aug 2023 16:18:35 GMT Exhausted searching IDs in known runs, attempt 37...
Mon, 07 Aug 2023 16:18:43 GMT Exhausted searching IDs in known runs, attempt 38...
Mon, 07 Aug 2023 16:18:50 GMT Exhausted searching IDs in known runs, attempt 39...
Mon, 07 Aug 2023 16:19:05 GMT Exhausted searching IDs in known runs, attempt 40...
Mon, 07 Aug 2023 16:19:07 GMT Exhausted searching IDs in known runs, attempt 41...
Mon, 07 Aug 2023 16:19:12 GMT Error: Failed to complete: Timeout exceeded while attempting to get Run ID
Mon, 07 Aug 2023 16:19:12 GMT Warning: Does the token have the correct permissions?
Mon, 07 Aug 2023 16:19:12 GMT Error: Timeout exceeded while attempting to get Run ID

jduan-highnote avatar Aug 07 '23 16:08 jduan-highnote

Apologies, I missed this issue being raised

Yes typically this would indicate that the run fails to start, do you find that the run is starting and this action is failing to get the ID?

Mitigate in which sense? Are you using this action to simply kick off another action and are not concerned for the ID? Or?

Codex- avatar Sep 04 '23 05:09 Codex-

What I see often is that a run was successfully triggered (the "Distinct ID" matches the expected ID) but this action couldn't find it after many tries.

jduan-highnote avatar Sep 04 '23 17:09 jduan-highnote

Hmmm I wonder if the rework i'm doing for v2 will help. In the interim, can you please post the full log with debug logging enabled?

You can enable this by adding a secret ACTIONS_RUNNER_DEBUG with true (info on this here https://docs.github.com/en/actions/monitoring-and-troubleshooting-workflows/enabling-debug-logging)

Codex- avatar Sep 05 '23 00:09 Codex-

Good idea. I just added that secret. Will keep you posted. Thanks!

jduan-highnote avatar Sep 05 '23 18:09 jduan-highnote

Happy to re-open this should there still be any issues, let me know :)

Codex- avatar Sep 12 '23 23:09 Codex-

So this happened today.

It failed to find a workflow run with the matching dispatch id:

ISMM-2399__wrap_event_sourcing_in_transaction_context___9843__·_bay1inc_bayone_efaaa1c

However, the workflow run was dispatched successfully (the id matches)

Recurring_API_smoke_tests_·_bay1inc_api-platform_e8efc6f

jduan-highnote avatar Sep 14 '23 04:09 jduan-highnote

How long is it between the job starting and the action dispatching? do you use custom runners or gh runners?

Codex- avatar Sep 14 '23 06:09 Codex-

The action was dispatched at Thu, 14 Sep 2023 04:44:27 GMT and the job started running at Thu, 14 Sep 2023 04:44:35 GMT so it took 8 seconds.

We use self-hosted runners for this job. We used to use self-hosted runners and had this problem as well.

jduan-highnote avatar Sep 14 '23 15:09 jduan-highnote

Weird, does the output for return-dispatch to catch happen before anything else?

Codex- avatar Sep 19 '23 02:09 Codex-

Sorry what you mean exactly?

jduan-highnote avatar Sep 19 '23 04:09 jduan-highnote

Ah sorry, didn't explain myself very well, does this step happen before any other steps?

      - name: echo distinct ID ${{ github.event.inputs.distinct_id }}
        run: echo ${{ github.event.inputs.distinct_id }}

Codex- avatar Sep 19 '23 04:09 Codex-

No, that job only has one step like this:

  id:
    runs-on: [infra, us-central1, build-small]
    if: ${{ github.event.inputs.distinct_id != null }}
    steps:
      - name: Distinct ID ${{ github.event.inputs.distinct_id }}
        run: echo ${{ github.event.inputs.distinct_id }}

jduan-highnote avatar Sep 19 '23 04:09 jduan-highnote

@jduan-highnote Could it be because your targeted repository triggers a lot of workflows on regular basis? I see that the API call to get the workflow runs only fetches 5 when a branch name is specified as ref, and only 10 otherwise:

https://github.com/Codex-/return-dispatch/blob/78fb303cfd7699e9c743c2c76479906f0e970b80/src/api.ts#L135-L147

If more than 5 workflows have been triggered at the same time, then your run might never be in the first page of results and therefore impossible to find. It would be great to increase the per_page value or make it configurable.

pmrotule avatar Nov 03 '23 14:11 pmrotule

@pmrotule that's great insight. It's certainly possible that the target workflow has more than 5 concurrent runs at some point. It'd be great to make this a configurable param! @Codex-

jduan-highnote avatar Nov 03 '23 15:11 jduan-highnote

@jduan-highnote I've cut a new release with some enhancements courtesy of @pmrotule

Let me know how it goes (v1 tag has been updated so if you're using the v1 tag you shouldn't have to change anything)

Codex- avatar Nov 23 '23 21:11 Codex-

@Codex- can you given a summary of the recent enhancements? We do use v1 across the board so we will pick up the enhancements automatically.

jduan-highnote avatar Nov 25 '23 23:11 jduan-highnote

An octokit iterator is now being used to check for runs across multiple API response pages

Codex- avatar Nov 26 '23 05:11 Codex-

Nice! I'll keep you posted.

jduan-highnote avatar Nov 27 '23 16:11 jduan-highnote

An octokit iterator is now being used to check for runs across multiple API response pages

@Codex- The iterator is only used to get the workflow id within getWorkflowId, but getWorkflowRunIds is still only looking at the first 5 runs:

https://github.com/Codex-/return-dispatch/blob/dbbbe1d543b3d42aee5613cd7eb4dea947dc54a5/src/api.ts#L148-L160

So I think that my assumption from last month is still valid:

If more than 5 workflows have been triggered at the same time, then your run might never be in the first page of results and therefore impossible to find.

A nice way to solve this would be to use an iterator within getWorkflowRunIds as well, but with a limit since going through all workflow runs in the history of a repository could take a very long time. So something like this:

const workflowRunIterator = octokit.paginate.iterator(
  octokit.rest.actions.listWorkflowRuns,
  {
    owner: config.owner,
    repo: config.repo,
    workflow_id: workflowId,
    per_page: 20,
  },
);
let runIds: number[] = [];
let page = 1;
const maxPage = 5;

for await (const response of workflowRunIterator) {
  if (response.status !== 200) {
    throw new Error(
      `Failed to get Workflow runs, expected 200 but received ${response.status}`,
    );
  }
  // wrong type definition
  const runs: typeof response.data.workflow_runs = response.data;
  runIds = runs.map((workflowRun) => workflowRun.id);

  if (runIds.length || page === maxPage) {
    break;
  }
  page++;
}

pmrotule avatar Dec 01 '23 11:12 pmrotule

I am also running into this issue:

Successfully dispatched workflow: Repository: *** Branch: *** Workflow ID: ***.yml Distinct ID: 348a1190-04f9-4c90-b5e5-ef3608dd4265 Attempt to extract run ID from steps... Exhausted searching IDs in known runs, attempt 1... Exhausted searching IDs in known runs, attempt 2... Exhausted searching IDs in known runs, attempt 3... Exhausted searching IDs in known runs, attempt 4... Exhausted searching IDs in known runs, attempt 5... Exhausted searching IDs in known runs, attempt 6... Exhausted searching IDs in known runs, attempt 7... Exhausted searching IDs in known runs, attempt 8... Exhausted searching IDs in known runs, attempt 9... Exhausted searching IDs in known runs, attempt 10... Error: getWorkflowRunJobs: An unexpected error has occurred: Server Error Error: Failed to complete: Server Error Warning: Does the token have the correct permissions? Error: Server Error

The weirdest thing is that it will keep attempting to find the ID while the successfully triggered pipeline is running and then hits the Server Error or timeout error once the pipeline is done. If there's anything I can add to help with troubleshooting, please let me know. Thank you for working on this, this functionality should be included in Github Actions by default.

ChrisPage-AT avatar Apr 02 '24 20:04 ChrisPage-AT

@ChrisPage-AT When the search is being exhausted, if you do an API request yourself, is the run present?

Thank you for working on this, this functionality should be included in Github Actions by default.

Totally agree, really hoping we see it one day

Codex- avatar Apr 02 '24 21:04 Codex-

@Codex- Using the api calls listed in your readme, I can find the triggered pipeline with List Workflow Runs and use the id to see the jobs with List Jobs for a workflow run. However I am noticing that neither of those calls seem to list the distinct_id that I'm passing in. Are you doing a different call to obtain that?

ChrisPage-AT avatar Apr 03 '24 13:04 ChrisPage-AT

I started seeing this error recently. You can see that the workflow was dispatched successfully and the ID matches.

Cursor_and_Dataflow_Release_·_bay1inc_bayone_f91bb88 Cursor_and_Deploy_Dataflow_Job_·_bay1inc_release_6b29f96

jduan-highnote avatar Apr 05 '24 16:04 jduan-highnote

Server error? Unusual, can you enable debug logging? https://docs.github.com/en/actions/monitoring-and-troubleshooting-workflows/enabling-debug-logging

Can hopefully get some more info there

Codex- avatar Apr 06 '24 11:04 Codex-

Interesting. I wasn't aware of enabling debug logs for runners. Should I enable both runner & step debug logs?

jduan-highnote avatar Apr 06 '24 21:04 jduan-highnote

I think step but I'd do both to get a full picture

Codex- avatar Apr 07 '24 06:04 Codex-

I enabled debug log a few weeks ago and today the problem happened for the first time. Please see the attached logs.

Thanks for that @jduan-highnote , I'll have a dig when I get a chance. Really odd that the ID being returned is so short, almost looks truncated.

I've deleted your post so as to not reveal anything about your infra

Codex- avatar Apr 25 '24 21:04 Codex-

Thank you for being so thoughtful!

jduan-highnote avatar Apr 26 '24 00:04 jduan-highnote

Here's another occurrence if it helps:

Fri, 26 Apr 2024 19:25:45 GMT
Run codex-/return-dispatch@v1
Fri, 26 Apr 2024 19:25:45 GMT Fetching Workflow ID for prbox_update_routing.yaml...
Fri, 26 Apr 2024 19:25:45 GMT Fetched Workflow ID: 88382506
Fri, 26 Apr 2024 19:25:46 GMT Successfully dispatched workflow:
Fri, 26 Apr 2024 19:25:46 GMT   Repository: bay1inc/devops
Fri, 26 Apr 2024 19:25:46 GMT   Branch: main
Fri, 26 Apr 2024 19:25:46 GMT   Workflow ID: prbox_update_routing.yaml
Fri, 26 Apr 2024 19:25:46 GMT   Workflow Inputs: {"services":"non-pci","tenants":"test"}
Fri, 26 Apr 2024 19:25:46 GMT   Distinct ID: f9c198f0-84b8-4943-a7cf-31be5bcec42b
Fri, 26 Apr 2024 19:25:46 GMT Attempt to extract run ID from steps...
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]Attempting to fetch Run IDs for Workflow ID 88382506
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]failed to get branch for ref: main, please raise an issue with this git ref.
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]Fetched Workflow Runs:
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Repository: bay1inc/devops
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Branch: undefined
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Workflow ID: 88382506
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Runs Fetched: [8852916369,8852872434,8852351822,8852314544,8852302751,8852270907,8851776517,8851726747,8851694157,8851681964]
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]Attempting to get step names for Run IDs: [8852916369,8852872434,8852351822,8852314544,8852302751,8852270907,8851776517,8851726747,8851694157,8851681964]
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]Fetched Workflow Run Job Steps:
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Repository: bay1inc/devops
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Workflow Run ID: 8852916369
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Jobs Fetched: [24312583511,24312583707,24312583931,24312591506,24312619999,24312642954]  Steps Fetched: [Set up job,Print Input Values,Clone Devops,Generate Matrix JSON,Post Clone Devops,Complete job,Distinct ID 61897684-539d-4539-b241-54ec6159841e,Set up runner,Setup K8s auth,Deploying Apps,Complete runner]
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]Fetched Workflow Run Job Steps:
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Repository: bay1inc/devops
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Workflow Run ID: 8852872434
Fri, 26 Apr 2024 19:25:46 GMT ##[debug]  Jobs Fetched: [24312437185,24312437385,24312437629,24312444058,24312464693,24312481469]  Steps Fetched: [Set up job,Print Input Values,Clone Devops,Generate Matrix JSON,Post Clone Devops,Complete job,Distinct ID fcc9476c-2cae-43a1-8850-ae2df9728507,Set up runner,Setup K8s auth,Deploying Apps,Complete runner]
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]Fetched Workflow Run Job Steps:
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]  Repository: bay1inc/devops
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]  Workflow Run ID: 8852351822
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]  Jobs Fetched: [24310882803,24310883124,24310883576,24310892691,24310930047,24310957296]  Steps Fetched: [Set up job,Print Input Values,Clone Devops,Generate Matrix JSON,Post Clone Devops,Complete job,Distinct ID 0fec8571-ebc2-49f5-9364-747a1b6d3649,Set up runner,Setup K8s auth,Deploying Apps,Complete runner]
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]Fetched Workflow Run Job Steps:
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]  Repository: bay1inc/devops
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]  Workflow Run ID: 8852314544
Fri, 26 Apr 2024 19:25:47 GMT ##[debug]  Jobs Fetched: [24310757841,24310757992,24310758179,24310764698,24310791043,24310814079]  Steps Fetched: [Set up job,Print Input Values,Clone Devops,Generate Matrix JSON,Post Clone Devops,Complete job,Distinct ID 6822826a-513e-4f60-8535-fe7573de521e,Set up runner,Setup K8s auth,Deploying Apps,Complete runner]
Fri, 26 Apr 2024 19:25:57 GMT Error: getWorkflowRunJobs: An unexpected error has occurred: Server Error
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]HttpError: Server Error
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at file:///runner/_work/_actions/codex-/return-dispatch/v1/dist/index.mjs:19832:26
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at async getWorkflowRunJobSteps (file:///runner/_work/_actions/codex-/return-dispatch/v1/dist/index.mjs:23186:22)
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at async run (file:///runner/_work/_actions/codex-/return-dispatch/v1/dist/index.mjs:23269:25)
Fri, 26 Apr 2024 19:25:57 GMT Error: Failed to complete: Server Error
Fri, 26 Apr 2024 19:25:57 GMT Warning: Does the token have the correct permissions?
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]HttpError: Server Error
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at file:///runner/_work/_actions/codex-/return-dispatch/v1/dist/index.mjs:19832:26
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at async getWorkflowRunJobSteps (file:///runner/_work/_actions/codex-/return-dispatch/v1/dist/index.mjs:23186:22)
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]    at async run (file:///runner/_work/_actions/codex-/return-dispatch/v1/dist/index.mjs:23269:25)
Fri, 26 Apr 2024 19:25:57 GMT Error: Server Error
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]Node Action run completed with exit code 1
Fri, 26 Apr 2024 19:25:57 GMT ##[debug]Finishing: Trigger routing updates

jduan-highnote avatar Apr 26 '24 20:04 jduan-highnote

@Codex- do you have enough logs for debugging? I'd like to turn off debug logs for our github actions.

jduan-highnote avatar May 03 '24 21:05 jduan-highnote