backstage icon indicating copy to clipboard operation
backstage copied to clipboard

[RFC] Handling webhooks in Backstage (Github)

Open kissmikijr opened this issue 3 years ago • 11 comments

Status: Open for comments My intention was to design a general system that is extensible by the individual webhook handler implementations and distributes the incoming webhooks to the listeners. In this RFC I created a possible Github implementation.

Need

A system that would enable the contributors to be able to create modules/extension to receive webhooks in backstage in an extensible way and have the possibility for other plugins to act on these webhooks.

A way to be able to receive webhooks → to keep the catalog up to date via a push model.

The catalog currently uses a pull model to keep data consistent with the different SCM integrations through the autodiscovery modules. They create locations in the catalog, and the processors ingest all the entities based on these locations.

Drawbacks for this implementation are resource usage, time, and possible rate limit issues, not real-time update on changes in the catalog files. It is configurable via the catalog.ts file to run the refresh loop at a set interval. If we make it small to have a more real-life feel, we can potentially exceed the rate limits (GitHub), if we make it run less frequently we won’t get immediate feedback to changing the catalog files in git.

With webhook support, we would get the benefit of real-time updates in the catalog without exceeding the rate limit. Potentially multiple plugins could subscribe to these events and act on them.

Proposal

image

To achieve this we are going to need the following extensions.

  • webhooks-backend (new)
  • webhooks-backend-module-github (new)
  • events-backend-module (new)
  • catalog-backend (existing)

webhooks-backend (new)

The webhooks-backend plugin will be the entry point for the backstage instance to receive any kind of webhooks. It is going to be a regular backend plugin wired into the packages/backend as any other [backend plugin](https://backstage.io/docs/plugins/backend-plugin#developing-your-backend-plugin).

It exposes the route /webhooks and any additional module routes will be attached after this one. It does not handle any kind of webhooks, it should not be called directly. Its purpose is to collect all the other webhooks-backend-module installations and pass over the dependencies. Adding a specific webhook installation to this plugin should be done like this:

import { errorHandler } from '@backstage/backend-common';
import express from 'express';
import Router from 'express-promise-router';
import githubWebhooks from '@backstage/webhooks-backend-module-github';

export async function createRouter({ eventsClient }): Promise<express.Router> {
	...
  router.use('/github', await githubWebhooks({ eventsClient }));
	...
  return router;
}

webhooks-backend-module-github (new)

This will be the implementation for receiving the github webhooks. It exposes the route POST /github It returns a router.

Dependencies:

  • @backstage/events-backend-module

The GithubWebhookService responsible for the filtering of the events to act only when a push event is received where one of the commits contains a backstage specific yaml file, then publish a message via the EventsClient to the github-webhooks topic.

One repository could contain any number of backstage catalog files, they could be setup to have a root catalog-info.yaml which is a Location file that could list any number of other backstage catalog files, of different kinds. In this case, we’d like to refresh all the individual entity files when there is modification only in one of the sub catalog files. The Github webhook API (link) on pushes sends the modified,added,removed file names in the aforementioned three arrays. I can see a couple of variations on how could this be handled.

  • option 1

    Create a config.d.ts addition for the backend key where the users could configure which paths should the plugin expect to receive backstage catalog files. In this case, when the file path matches the configuration the module would construct a LocationSpec based on the path and repo name with the type url and would publish the correct message to the event system. On receiving the event with the new refreshByLocation function it would trigger a refresh for the entity.

    This implementation should be easier to implement but requires more configuration from the users.

  • option 2

    Without additional configuration. The module could act on any changes in a *.yaml file. First, it should check with some regex if it is a backstage yaml file. A backstage yaml file should include a line: apiVersion: [backstage.io/](http://backstage.io/) or something similar. If it is a backstage file, it should publish the messages to the event system to trigger a create, update, or delete action on the file. Even though we read the content of the file, I suggest we still communicate using the Location spec of the entity file, because one catalog yaml can contain 1 or more backstage yaml files.

    This solution has the pro that it does not require additional configuration however the implementation would be more complex, and tied to querying the github api.

I suggest option 2.

messages based on the received webhook

  • added - the catalog-info.yaml file was added to a repository where the webhook already was configured
  • modified - the catalog-info.yaml file was modified need to update the existing entity
  • removed - the catalog-info.yaml file was removed from the repository we need to delete the entities
// @backstage/catalog-backend
export interface LocationInput {
  type: string;
  target: string;
}

{
  type: "added" | "modified" | "removed";
  payload: LocationInput;
}

events-backend-module (new)

A new plugin to handle the distributing of the events that are going to be used backstage.

I propose this be built as a pub/sub pattern with topics. It would mean the publishers would be responsible for creating certain topics and publishing their messages to those topics. The subscribers would be able to subscribe to any kind of topic they’d like.

In my opinion, this is good because webhook implementation could publish its own events to their own topics and the plugins could subscribe to any number of topics they are interested to act on.

This plugin should be configurable via the app-config.yaml to tell it which backend implementation to use.

A client config to tell the plugin which client should be used

//config.d.ts
export interface Config {
  eventSystem :{
	client: 'memory' | 'rabbitmq' | 'kafka' | 'db'
  }
}
./plugins
	./events-backend-module
     ./clients
        DefaultEventsClient.ts
	DatabaseEventsClent.ts
         ...
export interface EventsClient {
  subscribe(options:{topic: string, callback: fn}): void {}
	publish(options: {topic: string, payload: object}): void {}
}

A dummy implementation using the built in nodejs EventEmitter could look like this, eventually moving this to a custom built pubsub backed by postgresql/sqlite

export class DefaultEventsClient implements EventsCient {
  constructor(){
    this.eventEmitter = new EventEmitter()
  }
  subscribe(options){
    this.eventEmitter.on(options.topic, options.callback)
  }
  publish(options){
    this.eventEmmitter.emit(options.topic, options.payload)
  }
}

This plugin could be responsible to configure the public EventsClient class and expose the configured client.

This would give the possibility to implement a client for Rabbitmq/kafka as long as they implement the EventsClient interface they should be good.

events-client backed by db

A possible naive DatabaseEventsClient implementation to start a discussion.

There would be an events table. It is a requirement to be able to subscribe with multiple clients to an event stream for this we’ll need to keep track of the clients in an events_clients table.

The publish would create a new row in the events table for every invoke of the function.

publish(options) {
  this.database('events').insert({topic: options.topic, payload: options.payload})
} 

The subscribe function polls the events table to check for new events. Queries the offset from the events_clients -> queries the events -> handles the events -> updates the offset

subscribe(options) {
  setInterval(() => {
    this.datanase.transaction(tx=>{
      const clientData = tx('events_clients').select('*').where('name', =, this.clientName)
      const events = tx('events').select('*').where({topic: options.topic}).orderBy('id', 'desc').limit(clientData[0].offset)
      //handle events'
      tx('events_clients').where({name: this.clientName}).update({offset: clientData[0].offset+1})
    })
  }, pollingInterval);
}

In the current state, every plugin/package that wants to subscribe to the events should instantiate their own EventsClient with their own unique id, potentially the module name. This can be an issue when we’d like to split and scale the individual modules.

I think this part might be worth its own RFC. For this RFC I used nodejs built-in event system.

catalog-backend (extension)

In the current implementation of the catalog backend, there is no need to be able to trigger updates based on the location updating. The refresh of the entities happens in a loop in a set interval. There is an implementation in the DefaultProcessingDatabase for refreshing individual entities.

With listening on webhooks from different providers we will need to fire the refresh for all the entities that belong to a certain location when we get a modification event from the provider. (I researched how GitHub webhooks work for this only. In the case of GitHub it lists the modified file in a modified files array in the webhook body)

I’d like to introduce an extension to the database to be able to refresh all the entities that belong to a certain location. This information is not present in the database as an individual field. I found that for triggering an update we could use the [backstage.io/managed-by-location](http://backstage.io/managed-by-location)` annotation on the entities.

I propose the following new function:

// types.ts
export interface ProcessingDatabase {
...
refreshByLocation(
    txOpaque: Transaction,
    options: { location: string },
  ): Promise<void>;
...
}

A possible query for this without starting to store in separate columns the managed-by-location annotation is to query into the final_entity column.

SELECT final_entity FROM final_entities
WHERE final_entity::json -> 'metadata' -> 'annotations' ->> 'backstage.io/managed-by-location' = 'url:https://github.com/RoadieHQ/roadie/tree/main/catalog-info.yaml';

This query returns all the entities that have the managed-by-location annotation that matches the location string. When we have all the entities we could schedule a refresh on the individual entities.

catalog-backend-module-github (extension)

I propose to extend the catalog-backend-module-github with an EntityProvider → GithubWebhooksEntityProvider

This provider would be responsible to subscribe to the events on the github topic. For subscribing an EventsClient should be injected into it.

It would handle the incoming events as follows:

On event type: added

It would trigger an applyMutation to create a LocationEntity for the provided target and type.

this.connection.applyMutation{
    type: 'full',
    entities: [{
          locationKey: this.getProviderName(),
          entity: locationSpecToLocationEntity({
            type,
	    target,
            presence: 'optional',
	}),
      }];
    }

On event type: modified:

For handling the modified events the EntityProviderConnection interface needs to be extended by a new function: refreshByLocation. This would be a proxy for the processing database’s refreshByLocation function.

this.connection.refreshByLocation(options: LocationInput)

On event type: deleted:

It would trigger an applyMutation to delete the LocationEntity for the provided target and type.

this.connection.applyMutation{
   type: 'delta',
   added: [],
	 removed: [{
     locationKey: this.getProviderName(),
      entity: locationSpecToLocationEntity({
        type,
	target,
	presence: 'optional',
    }),
   }];
 }

Alternatives

I am not sure there is an alternative in a general concept. We currently use a pull model this wants to introduce the push model. In the implementation details, there could be vastly different approaches.

Risks

A risk I can see immediately is in the consumers of these events, the consumers cannot rely on the webhooks 100% that they will be delivered. If the catalog would only rely on these webhooks to keep its state up to date, on webhook delivery failures it could get out of sync pretty quickly. Potential mitigation is to keep the discoveries and increase the refresh interval to a big number since it is needed as a backup.

The webhook URLs should be accessible from the internet. The services will need to be able to post to the backstage instance. There should have some authorization handling in the webhook implementations. It will result in a bigger attack surface.

kissmikijr avatar Apr 26 '22 10:04 kissmikijr

Awesome!

  1. GitHub nowadays has support for polling payloads that you missed, by its Webhook Delivery API. If the events-backend would regularly poll that, it could repair its state, and it'd be safe.

Also, regarding GitHub webhooks:

Note: Payloads are capped at 25 MB. If your event generates a larger payload, a webhook will not be fired.

  1. So the events-backend should probably poll the repo HEAD~n (up until a known n) to check if changes have been made that haven't been sent as payloads (hence wouldn't end up as failed deliveries either), and internally "push" them to itself to mimic a webhook delivery.

I don't think, given the 2 above mechanisms, you'd need to poll individual yaml files 🤞

grantila avatar Apr 26 '22 13:04 grantila

Really nice suggestion, thank you! It seems we wouldn't need the refresh loop for github and it means we could configure the catalog backend to only use the GtihubWebhookEntityProvider and wouldn't need to configure the discovery processors for it.

I think the polling and state handling should be better inside the webhooks-backend-module-github instead of the general events-backend.

I still need to think through it end to end, but as I can see this could be a natural next extension to this, because you could just configure the processors and entity provider at the same time to not get desynced!

kissmikijr avatar May 03 '22 10:05 kissmikijr

This is a solid idea and you've clearly put a lot of thought into it! Might make sense to walk this through another SCM provider than GitHub to see if all our assumptions work? Azure DevOps Service and Server both support Web Hooks via what is called Service Hooks, here's the list of events and their payloads: https://docs.microsoft.com/en-us/azure/devops/service-hooks/events?view=azure-devops

awanlin avatar May 03 '22 12:05 awanlin

This is a solid idea and you've clearly put a lot of thought into it! Might make sense to walk this through another SCM provider than GitHub to see if all our assumptions work? Azure DevOps Service and Server both support Web Hooks via what is called Service Hooks, here's the list of events and their payloads: https://docs.microsoft.com/en-us/azure/devops/service-hooks/events?view=azure-devops

I think the individual entity provider implementations are not really dependant on the how it is going to be wired into the backend, connected together and receiving them. I think potentially for every supported provider we should create a separate RFC for the particular EntityProvider, that will be able to consume the events from that 3rd party.

kissmikijr avatar Jun 02 '22 15:06 kissmikijr

Thanks for this writeup!

I think it'll be interesting to see how this interacts with discovery (as opposed to refreshes). There is probably a strong case to be made to let adopters easily set up discovery, one way or another, backed by this event system, but having it be opt-in so it doesn't apply for a full org whether you want to or not. Maybe it'll be in terms of defining some root backstage config file that works like a dotfile in your repo (and which is also auto-discovered and handled by the backend!). Maybe it'll be in terms of "if you install this backstage github app, you opt into auto discovery". Maybe some other mechanism.

In this case, we’d like to refresh all the individual entity files when there is modification only in one of the sub catalog files

I am probably being a bit dense late in the day - this sentence made me wonder. Do we really want that? My spontaneous reaction is that it's enough to refresh only entities related to the actual modified file. Well, those may have relations to other entities of course, but those will be re-stitched automatically too as the originator of the relations changes.

freben avatar Jun 08 '22 14:06 freben

I opened this https://github.com/backstage/backstage/issues/12155 to be able to handle refreshes on individual entities based on some values. In the case of this RFC it would look like that we'd store the filenames, or the whole github URL for every entity. So when a push payload comes in we should be able to trigger a refresh for all of the entities that coming from that file.

kissmikijr avatar Jun 20 '22 20:06 kissmikijr

I think I'll rework this one a bit, with the merge of the refreshKeys we are getting closer. I am tempted to actually split this into an event system RFC, and another one for how would a GithubEntityProvider look like, since they are not really tied together in implementation anyways.

kissmikijr avatar Jul 07 '22 16:07 kissmikijr

I consider using a queue (AWS SQS) as buffer and also to keep Backstage within the internal network only for a webhook integration for Bitbucket Cloud. A first PoC works fine.

pjungermann avatar Sep 19 '22 19:09 pjungermann

The GithubWebhookService responsible for the filtering of the events to act only when a push event is received where one of the commits contains a backstage specific yaml file, then publish a message via the EventsClient to the github-webhooks topic.

the filtering will be tricky with Bitbucket Cloud's repo:push webhook. It will not contain enough information to allow for such filtering and you would need to make API calls to do so which would harm the overall goal to reduce API calls in the light of rate limits.

Of course, this is very SCM provider specific as the events differ for them.

pjungermann avatar Sep 20 '22 14:09 pjungermann

The Github webhook API (link) on pushes sends the modified,added,removed file names in the aforementioned three arrays.

This information is not available at Bitbucket Cloud repo:push events https://support.atlassian.com/bitbucket-cloud/docs/event-payloads/#Push

The logic for this would need to be put somewhere close to the provider.

Basically the part described at

webhooks-backend-module-github (new)

including LocationInput, etc. would not work there

pjungermann avatar Sep 20 '22 15:09 pjungermann

@kissmikijr re webhook-backend

Its purpose is to collect all the other webhooks-backend-module installations and pass over the dependencies. Adding a specific webhook installation to this plugin should be done like this:

Would you consider that all modules need to be registered there? Did I understand it right?

I would maybe see the webhook-backend as a dependency to the modules which then can register themselves to it. 🤔 And as a user, you decide which modules you need.

But maybe I misunderstood.

pjungermann avatar Sep 21 '22 09:09 pjungermann

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 20 '22 09:11 github-actions[bot]

@pjungermann Hows the event bus stuff going? Is there a tracking ticket/RFC for that?

webark avatar Nov 20 '22 20:11 webark

@webark there's a lot of activity, first iterations and integrations are in - check in the events-* plugins https://github.com/backstage/backstage/tree/master/plugins

freben avatar Nov 21 '22 09:11 freben

@webark basis and support for webhooks at the BitbucketCloudEntityProvider was merged (PR #13931) and is part of the 1.8.0 release.

There are a few follow-up PRs (PR #14691, PR #14688, PR #14689).

There is no activity yet regarding adding webhook support to other entity providers like the GithubEntityProvider. It shouldn't be very complicated though.

In case you can contribute, feel free :-)

pjungermann avatar Nov 21 '22 11:11 pjungermann

Amazing work @pjungermann

I want to help to support the GithubEntityProvider, I just need to implements the EventSubscriber interface? Is there any doc I can check to get more information?

angeliski avatar Nov 21 '22 14:11 angeliski

@angeliski thanks!

There are only the README.md files inside all modules. Would be great if you give it a try using these for now. Documentation as part of backstage.io makes sense in the long run.

And yes, you would need to implement EventSubscriber. And you should check out events-backend-module-github which contains an EventRouter implementation for GitHub. Additionally, you might want to check out PR #14689 -- even though it is not required for the implementation.

You can check the implementation at BitbucketCloudEntityProvider, too.

And feel free to reach out at Discord and/or open issues or PRs for unclear parts or missing features.

pjungermann avatar Nov 21 '22 22:11 pjungermann

Nice! I am looking how to do and getting some doubts, I will open a draft so we can talk about those points

angeliski avatar Nov 21 '22 22:11 angeliski

I started to work in https://github.com/backstage/backstage/pull/14758 as draft, it's working but I want to get a better overview before submit to review (it is a little mess now)

angeliski avatar Nov 22 '22 01:11 angeliski

I started the draft for handle org events https://github.com/backstage/backstage/pull/14870

angeliski avatar Nov 25 '22 17:11 angeliski

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jan 24 '23 17:01 github-actions[bot]