Revisit configuration management to support modification via UI
Most of the configuration options for new services are currently done via application.properties or environment variables.
This is in contrast to DT 4.x, which used primarily the UI. The drawback to UI configuration is that a new instance cannot easily be deployed with the desired settings in place, but requires manual adjustments.
However, there is still a requirement to have configuration options tweakable via UI. We have to find a way to allow for both models, as apparently relying on only one doesn't cut it.
Another challenge with dynamic configuration is that we need a mechanism to notify services about changes that are impacting them. For example, when an API key for Snyk is changed, the vulnerability analyzer needs to be notified.
If we have the services poll the database every time they need the config, we're going to DoS the database. If we add caching, we need proper cache invalidation (hard), or have to live with eventually consistent configs (bad).
Typically, systems rely on ZooKeeper or etcd for use cases like this, but we can't introduce more such heavy dependencies at this point.
It may be worth looking into PostgreSQL's LISTEN / NOTIFY support: https://www.postgresql.org/docs/current/sql-notify.html
As per further discussion, Postgres' LISTEN / NOTIFY mechanism is not a good option, as it won't work across replicas. We specifically chose to make DB interactions of services read-only to support the use of read-only replicas.
Instead, the following option was proposed:
- Introduce a new Kafka topic that notifies about changes in the
CONFIGPROPERTYtable- Records on that topic should not carry property values, but group names (e.g.
vuln-source) and names (e.g.github.token)
- Records on that topic should not carry property values, but group names (e.g.
- Services relying on configuration from the database subscribe to the topic
- NB: Consumers can't use the same consumer group, we need all instances to get all records
- Kafka Streams does something similar to support
GlobalKTables
- Kafka Streams does something similar to support
- NB: Consumers can't use the same consumer group, we need all instances to get all records
- Upon retrieval of a change record, applications reload their configuration from the database
- Consumers can further decide whether a reload is necessary based on the property group(s) and name(s) in the record
It is possible that there is a slight delay between instance A and instance B of a service reloading their configuration. When users update configurations in the UI, the respective change will eventually make it to all affected services. This should be fine, but we need to make sure that it is clearly documented so users know what to expect.
Another complication with the approach above:
If we're assuming that database read replicas are being used, we'll run into race conditions:
- User updates configuration via UI
- API server processes the update, stores it in the database (leader node), and emits a Kafka record to inform other services
- Other services receive Kafka record, and reload their configuration from a DB follower node / read replica
- Config update has not replicated to read replica yet, so services reload stale data
We would not have this problem if we distribute the entire configuration through Kafka, but not sure whether that's a route we want to take...
Given there is no way to achieve a truly consistent view of configurations for all applications in the system, without:
- Performing excessive database queries (i.e. at least one per processed Kafka record)
- Using a centralized solution such as ZooKeeper, etcd, Consul, that supports watches
The (hopefully final) proposal is as follows:
- Services query the database whenever they need to read a configuration
- For some services, these queries will be very low volume (e.g. mirror-service); Caching not strictly required, but doesn't hurt either
- For services where we anticipate a high volume of queries (e.g. vulnerability-analyzer), we utilize local read-through caching with short expiration intervals (few seconds up to 1min)
This should allow services like the vulnerability-analyzer to continue processing multiple thousands of records per second, without hammering the database. In the worst case, the cached configuration will be stale for up to 1min, but given the other options we have, that is actually not that bad.
Queries to fetch configuration options are lightweight. The biggest factor will be network latency, which the file-based configuration approach naturally didn't have.
On the Quarkus side of things, I'm thinking we should plug into the existing config framework: https://quarkus.io/guides/config-extending-support#custom-config-source
By the looks of it, this can also be made to support reloads at runtime, and even change notification. Looking into it more.