hyades Revisit configuration management to support modification via UI

Most of the configuration options for new services are currently done via application.properties or environment variables.

This is in contrast to DT 4.x, which used primarily the UI. The drawback to UI configuration is that a new instance cannot easily be deployed with the desired settings in place, but requires manual adjustments.

However, there is still a requirement to have configuration options tweakable via UI. We have to find a way to allow for both models, as apparently relying on only one doesn't cut it.

Nov 23 '23 11:11 nscuro

Another challenge with dynamic configuration is that we need a mechanism to notify services about changes that are impacting them. For example, when an API key for Snyk is changed, the vulnerability analyzer needs to be notified.

If we have the services poll the database every time they need the config, we're going to DoS the database. If we add caching, we need proper cache invalidation (hard), or have to live with eventually consistent configs (bad).

Typically, systems rely on ZooKeeper or etcd for use cases like this, but we can't introduce more such heavy dependencies at this point.

It may be worth looking into PostgreSQL's LISTEN / NOTIFY support: https://www.postgresql.org/docs/current/sql-notify.html

Nov 23 '23 11:11 nscuro

As per further discussion, Postgres' LISTEN / NOTIFY mechanism is not a good option, as it won't work across replicas. We specifically chose to make DB interactions of services read-only to support the use of read-only replicas.

Instead, the following option was proposed:

Introduce a new Kafka topic that notifies about changes in the CONFIGPROPERTY table
- Records on that topic should not carry property values, but group names (e.g. vuln-source) and names (e.g. github.token)
Services relying on configuration from the database subscribe to the topic
- NB: Consumers can't use the same consumer group, we need all instances to get all records
  - Kafka Streams does something similar to support GlobalKTables
Upon retrieval of a change record, applications reload their configuration from the database
- Consumers can further decide whether a reload is necessary based on the property group(s) and name(s) in the record

It is possible that there is a slight delay between instance A and instance B of a service reloading their configuration. When users update configurations in the UI, the respective change will eventually make it to all affected services. This should be fine, but we need to make sure that it is clearly documented so users know what to expect.

Jan 30 '24 10:01 nscuro

Another complication with the approach above:

If we're assuming that database read replicas are being used, we'll run into race conditions:

User updates configuration via UI
API server processes the update, stores it in the database (leader node), and emits a Kafka record to inform other services
Other services receive Kafka record, and reload their configuration from a DB follower node / read replica
Config update has not replicated to read replica yet, so services reload stale data

We would not have this problem if we distribute the entire configuration through Kafka, but not sure whether that's a route we want to take...

Feb 07 '24 16:02 nscuro

Given there is no way to achieve a truly consistent view of configurations for all applications in the system, without:

Performing excessive database queries (i.e. at least one per processed Kafka record)
Using a centralized solution such as ZooKeeper, etcd, Consul, that supports watches

The (hopefully final) proposal is as follows:

Services query the database whenever they need to read a configuration
For some services, these queries will be very low volume (e.g. mirror-service); Caching not strictly required, but doesn't hurt either
For services where we anticipate a high volume of queries (e.g. vulnerability-analyzer), we utilize local read-through caching with short expiration intervals (few seconds up to 1min)

This should allow services like the vulnerability-analyzer to continue processing multiple thousands of records per second, without hammering the database. In the worst case, the cached configuration will be stale for up to 1min, but given the other options we have, that is actually not that bad.

Queries to fetch configuration options are lightweight. The biggest factor will be network latency, which the file-based configuration approach naturally didn't have.

Feb 29 '24 17:02 nscuro

On the Quarkus side of things, I'm thinking we should plug into the existing config framework: https://quarkus.io/guides/config-extending-support#custom-config-source

By the looks of it, this can also be made to support reloads at runtime, and even change notification. Looking into it more.

Apr 22 '24 09:04 nscuro