middleware icon indicating copy to clipboard operation
middleware copied to clipboard

Webhook for deployments

Open jayantbh opened this issue 1 year ago • 14 comments

Why do we need this ?

Currently PR merges are assumed to be deployments for a repo, which is fair for any repo that runs on some kind of a CI.

But for many such repos that don't, we should at least support a webhook based mechanism that allows me to feed my deployment/workflow runs data into Middleware.

That will let me have a better picture of my Dora metrics with more accurate Lead Time, and Deployment Frequency.

jayantbh avatar Oct 01 '24 19:10 jayantbh

@jayantbh working on this.

Kamlesh72 avatar Oct 02 '24 11:10 Kamlesh72

Sure. Do share your approach before you begin implementation.

jayantbh avatar Oct 02 '24 11:10 jayantbh

[!IMPORTANT] This issue is tagged advanced. By taking this up you acknowledge that you're accepting that this will be a non-trivial change and may require thorough testing and review. Of course, this also means that we offer swag for someone who goes out of their way to tackle issues tagged advanced. 🚀 This also means we'll follow up on this regularly, and in case of inactivity the issue would be unassigned.

jayantbh avatar Oct 03 '24 15:10 jayantbh

@jayantbh Currently we take PR Merge or Workflow ( like Github_Actions ) for deployments. correct?

I am thinking to create a route that collects workflow/deployment webhook data. This captured data will be mapped and pushed into RepoWorkflowRuns. Separate adapter for each like bitbucket, circleci, gitlab etc.

This is basic idea, although more brainstorming needed.

Kamlesh72 avatar Oct 03 '24 18:10 Kamlesh72

This should ideally happen on the python backend (apiserver dir). But yes, you have the right idea. I'll let @adnanhashmi09 explain further.

jayantbh avatar Oct 04 '24 17:10 jayantbh

Keep the following in mind while implementing the workflow:

  1. Use authorization headers or custom headers for authenticating the workflow user. We should create a mechanism for users to create and update API keys. This would also include UI development efforts.

  2. The webhook should never cause the workflow to fail or take an excessively long time. It should return a status of 200 in all cases. In case of an error, the response body should contain the error message and possible ways to fix it.

  3. We need a mechanism to map these workflows to repositories linked with Middleware. Therefore, the webhook should also receive repository data for each workflow run.

  4. The processing of data should be asynchronous and not block the API response. The API request should resolve almost immediately after the request has been sent.

  5. The data should be processed in chunks, and the end user should send data in chunks, i.e., no more than 500 workflow runs data in a single call. This webhook should have the ability to sync large amounts of data and/or a single workflow run. Users can make a call to this webhook at the start and end of their workflow. We can infer the duration of the workflow run using that. Another case could be a user sending a number of their older workflow runs for us to process.

  6. A simple validation of received data should be performed when someone tries to upload data. If the required fields are not present, we should return a process error body with a status code of 200. We don't keep erroneous data.

  7. We would also need an API to prune the data synced if someone uploaded incorrect data and wanted to delete it.

  8. An API to revoke/generate API tokens is necessary.

  9. A frontend page to manage API tokens should be developed.

  10. Implement alerting/notification in case of erroneous data.

  11. A data dump for the request type, request body, response and error should be saved in case of an error. The data received from the end-user can be saved here and then later picked up for processing. So this could serve multiple purposes.

  12. We need some event based system to process workflow runs asynchronously without blocking the main thread. So whenever someone sends are request to our webhook we register an "event" which is picked up by a listener. When that event is invoked, the listener queries the database for the latest data to process and starts processing.

  13. The request body can be like as follows:

{
    "workflow_runs":[
        {          
            "workflow_name":"custom_workflow",
            "repo_names":["middleware"],
            "event_actor":"adnanhashmi09",
            "head_branch":"master",
            "workflow_run_unique_id":"unique_item",
            "status":"SUCCESS",
            "duration":"200", // can be provided, or we can infer this
            "workflow_run_conducted_at":"2024-09-28T20:35:45.123456+00:00"
        }
    ]
}

Read through the workflow sync once to check all the fields required for creating a RepoWorkflowRun

  1. A RepoWorkflow shall be created based on workflow_name and repo_names if not already present. This shall also be a part of validation, ie, if RepoWorkflow cannot be created due to the repo_names being wrong or not being linked to middleware shall result in an error.

So there are a lot of moving parts in this implementation and would require a thorough understanding of our system. Please read through the sync and document your approach here before starting to implement. This is a rather comprehensive task and would longer to implement.

adnanhashmi09 avatar Oct 07 '24 19:10 adnanhashmi09

@adnanhashmi09 Providers like Github actions, Gitlab, Circleci... are the sources of user deployment data. What other platforms can send data to our system? The structure of data will be different for each source, so we will need adapter to process data or user will be sending the structured response?

We can store all the incoming data into redis which will be later picked up for processing. We can only verify errors like API_Key before sending the response as data is not yet processed but we need to return 200 asap. So how we agree on point 6 as we are not processing data synchronously?

We also fetch data from Github Actions REST Api. So both rest api and webhook data will be stored in same table, right? Can a user also prune data fetched from Github Actions REST Api (point 7) ?

Can you please elaborate point 11 ?

Kamlesh72 avatar Oct 09 '24 13:10 Kamlesh72

@adnanhashmi09 Providers like Github actions, Gitlab, Circleci... are the sources of user deployment data. What other platforms can send data to our system? The structure of data will be different for each source, so we will need adapter to process data or user will be sending the structured response?

This webhook implementation is platform agnostic. We don't care about the workflow providers as the provider is not responsible for sending data. It is the user who integrates our webhook into their workflow who is responsible for sending the correct data. We will define a set of fields we require in the request body for us to register RepoWorkflow and RepoWorkflowRuns. It is up to the end user to make sure correct values are being sent.

We can store all the incoming data into redis which will be later picked up for processing. We can only verify errors like API_Key before sending the response as data is not yet processed but we need to return 200 asap. So how we agree on point 6 as we are not processing data synchronously?

Well, we can check for a few errors besides API_KEY errors. For instance, Maximum allowed data to be sent in once request, validate if the repo_names sent are linked with middleware or not. These operations are fairly quick to compute.

We also fetch data from Github Actions REST Api. So both rest api and webhook data will be stored in same table, right? Can a user also prune data fetched from Github Actions REST Api (point 7) ?

I don't think anybody would get github actions data from both integration and webhook. But yes, in practice we keep both data. We don't give the option to prune github actions data as they can always unlink that integration.

Can you please elaborate point 11 ?

We can save the entire request data in a database table including the data we receive for processing. This way we can check for errors and show alerts to the user by getting data from that table. It can also serve as a data dump to check what data has been received by our system for processing.

adnanhashmi09 avatar Oct 10 '24 22:10 adnanhashmi09

API KEYS

  • User will be able to Create, Read and Delete API Keys.
// APIKeys Table Schema in Postgres
API_KEYS {
    name: string,
    secret_key: string,
    scopes: string[], //  [ WORKFLOW, INCIDENT ]
    org_id: string
}

Receiving Webhook Data

// POST /public/webhook/deployments
// Headers: "X-Secret-Key": "secret_key"
{
    workflow_runs: [{
        workflow_name: string,
        repo_name: string,
        event_actor: string,
        head_branch: string,
        workflow_run_id: string,
        status: string, // Success, Failure, Pending, Canceled
        duration: number,
        workflow_run_conducted_at: string (ISO 8601 format),
        html_url: string
    }]
}

Pre Processing Validation

  • Verify API Key
  • Verify size of data
  • Verify required fields
  • Verify if repo exists in middleware If error, send 200 with error message and Notify user about erroneous data on email/slack. ( The notification module can be developed separately and later integrated into it )
    If success, then store data in WebhookEventRequests and call the queue, and send 200 response.

Store the data for processing

  1. Store the data in postgres table WebhookEventRequests (which act as DataDump table).
WebhookEvents {
    request_type: "DEPLOYMENT", // Or INCIDENT
    request_data: any, // For storing dump data
    status: string, // Waiting, Running, Skipped, Success, Failure
    error: string,
    created_in_db_at: time, 
    processed_at: time, 
    retries: number
}
  1. Call the Celery/RQ to process data async. The broker will be Redis.
  2. If there is any error, WebhookEvents will be updated accordingly with error.
  3. If no error, then update WebhookEvents and store data in RepoWorkflow and RepoWorkflowRuns.
  • Note: If the server goes down, some of the jobs may not be executed. So either their status in database will be Waiting or Running. Then on server restart, we can mark these as Skipped.

Prune the synced data

  • For a received webhook data, user can request to delete synced data.
  • The received webhook logs are visible in "Webhooks" page. Clicking on the log, user can check the received webhook data. Here he will have option to prune the synced data.
  • So if we go with this way, I think we will also need to store ids of the synced data.
  • We can set days after which received webhook data will be flushed.

UI

There will be 2 pages namely: Webhooks, API Keys.
Webhooks page will show logs for received deployments and incidents, corresponding queue status can be seen here.

Webhook and API Keys page UI

Kamlesh72 avatar Oct 23 '24 16:10 Kamlesh72

@adnanhashmi09 are any changes required?

Kamlesh72 avatar Mar 11 '25 17:03 Kamlesh72

API KEYS

  • User will be able to Create, Read and Delete API Keys.

// APIKeys Table Schema in Postgres API_KEYS { name: string, secret_key: string, scopes: string[], // [ WORKFLOW, INCIDENT ] org_id: string }

Can we not leverage the already existing Integrations table for this?



Pre Processing Validation

  • Verify API Key
  • Verify size of data
  • Verify required fields
  • Verify if repo exists in middleware If error, send 200 with error message and Notify user about erroneous data on email/slack. ( The notification module can be developed separately and later integrated into it ) If success, then store data in WebhookEventRequests and call the queue, and send 200 response.

No need to send slack notifications. We will just send detailed error response with error message and how to solve it. We have a system logger in the UI. we can use that same page to display those errors. But then that means we would need to store those errored response somewhere (WebhookEvents table can be leveraged)

Store the data for processing

  1. Store the data in postgres table WebhookEventRequests (which act as DataDump table).

WebhookEvents { request_type: "DEPLOYMENT", // Or INCIDENT request_data: any, // For storing dump data status: string, // Waiting, Running, Skipped, Success, Failure error: string, created_in_db_at: time, processed_at: time, retries: number } 2. Call the Celery/RQ to process data async. The broker will be Redis. 3. If there is any error, WebhookEvents will be updated accordingly with error. 4. If no error, then update WebhookEvents and store data in RepoWorkflow and RepoWorkflowRuns.

  • Note: If the server goes down, some of the jobs may not be executed. So either their status in database will be Waiting or Running. Then on server restart, we can mark these as Skipped.
  1. For how long do we store the data here?
  2. Which one will you pick, celery or RQ? I think RQ would be simpler and more suitable for our needs? Also do we really need to include Redis? I think for our scale and data throughput postgres could function fairly well as a broker. What do you think? We store the events in Postgres, which keeps getting picked up by a processor. The more components we add to the system the more complex it becomes making debugging unnecessarily complex.
  3. Why mark them as Skipped? Will they not be processed? This can cause data inconsistency

Prune the synced data

  • For a received webhook data, user can request to delete synced data.
  • The received webhook logs are visible in "Webhooks" page. Clicking on the log, user can check the received webhook data. Here he will have option to prune the synced data.
  • So if we go with this way, I think we will also need to store ids of the synced data.
  • We can set days after which received webhook data will be flushed.
  1. I think a better place to view logs would be on the System Logs page.
  2. What do you mean by ids of synced data? Every entity we eventually save in the DB will have an id associated with it. Instead of deleting based on id, we can give the option of pruning data based on time interval.

UI

There will be 2 pages namely: Webhooks, API Keys. Webhooks page will show logs for received deployments and incidents, corresponding queue status can be seen here.

Webhook and API Keys page UI

Lets keep the webhooks integration and API key management in the "Integrations" page itself. We can have a button on the card of Webhook integrations called manage where the user can manage their APi keys and data pruning.

adnanhashmi09 avatar Apr 11 '25 09:04 adnanhashmi09

@Kamlesh72 A flow diagram and LLD will really help solidify your approach. Currently I am unable to see the lifecycle of a request nor do I see details regarding the async processing of data.

adnanhashmi09 avatar Apr 11 '25 09:04 adnanhashmi09

@adnanhashmi09 Agreed, using Integration table for storing API Keys is better.

Using postgres as a broker opens lots of options, I checked Postgres as broker is not supported by celery and RQ.
I think we can use this library: https://procrastinate.readthedocs.io/en/stable/discussions.html


No need to send slack notifications. We will just send detailed error response with error message and how to solve it. We have a system logger in the UI. we can use that same page to display those errors. But then that means we would need to store those errored response somewhere (WebhookEvents table can be leveraged)

Got it, so request_data and error if any can be stored in WebhookEvents table.
Then WebhookEvent.id can be passed to the queue.


I think a better place to view logs would be on the System Logs page.

Here we can query procrastinate_jobs and using procrastinate_job.args.webhook_id, we get corresponding WebhookEvent to show the payload and error. ( schema: https://github.com/procrastinate-org/procrastinate/blob/main/procrastinate/sql/schema.sql )
Still there will be some WebhookEvents which are errored during "Pre Processing Validation" stage (as no job enqueued).
This also needs to be shown in System logs right?


Pruning data based on time interval is very user friendly. Thanks! 🙌🏻

I will share the LLD and dataflow diagram soon.

Kamlesh72 avatar Apr 12 '25 10:04 Kamlesh72

I think instead of pre validation, we can only check API Key and payload size.
If both valid, then let queue handle it.

All above details are added in this doc. https://whimsical.com/webhook-for-deployments-TnDQV6p8YMZ7ciaExNghJh

Kamlesh72 avatar Apr 12 '25 10:04 Kamlesh72