kafka-gitops Environment-specific state?

I'm using kafka-gitops for managing access across multiple clusters in different environments, and finding that it would be nice to have a way to deploy some state to certain environments and not others. For example, my testing/user-dev cluster might have topics for in-development projects or extra topics for testing different application configurations that I would never want to reach prod, but I still want to manage via desired-state configuration.

Some ideas:

Use a tool like jq/yq to merge the environment-specific state file with the global one before passing to kafka-gitops. This leaves open the question of how exactly to merge/deal with conflicts.
Change kafka-gitops to accept multiple state files and merge them itself before validating/planning/applying. This moves the problem of conflicts to this project; I would totally understand if you didn't want to take on that complexity at this time.
Use two separate state files and execute apply twice, using prefixes on the environment-specific state and settings.topics.blacklist.prefixed to prevent the two from clashing.
kafka-gitops adds an option/setting to prefix all of it's changes with some string, accomplishing basically the same as 3, but built-in. (An analogy might be docker stack deployments prefixing all containers/networks/secrets/etc with the name of the stack.)

Some dissenting thoughts:

Don't. Maintain separate state files for each environment. This is not my preference because then the main state file is not being tested in the lower-level environments, and it just doesn't feel very devops-y 😛
Use a single state file, but maintain environment differences using separate git branches. I'm afraid that this would devolve into a mess of git branches, custom merges, cherry-picks etc. But maybe this would work better than I'm imagining.
Don't use multiple clusters / don't have inconsistent configuration between environments / this is a terrible idea. Fair enough 😄, but why?

What do you think?

Aug 07 '20 18:08 infogulch

Hi, thanks for the lengthy description! This is a totally valid issue/enhancement request.

Currently, my organization and projects follow number 5. We have a separate state file for each environment. I agree, it is not thoroughly tested through lower environments -- but this is semi-common with Terraform and similar tools (at least in my experience).

My thoughts:

Good thought, but I don't think I would want to make it that complicated on the end-user.
Agreed. This reminds me of helm charts, where you have a base chart and value overrides. Not sure I'd want to add something that complex at this time, though I like this option best.
I could see this one getting pretty messy, pretty quickly.
I like this, but I wouldn't want to have to prefix topic names and such. Topic names should be the same for every environment unless you're doing the logical partitioning of a cluster (e.g. your #7 :P), in my opinion.
This is the current solution that works and isn't awful.
Separate git branches would be a nightmare, especially if you have multiple people/teams making PRs to add in topics/services. I think if you're doing that, you may as well just have folders on one branch for each environment like #5.

My proposal: My proposal would be to have a single state file that allows you to specify a list of environments in the settings block, such as:

settings:
  environments:
    - production
    - staging
    - qa

By default, all topics/services will inherit all environments. From here, you could override it on a topic/service level, as so:

topics:
  my-test-topic:
    partitions: 6
    replication: 3
    environments:
      - staging
      - qa

Then, you would be able to run kafka-gitops -f state.yaml --environment qa or something similar to plan/apply for a specific environment. This proposal allows users to choose to do one state file with environments defined inside of it or stick with separate state files per environment. I like this because it's built-in and fairly clear, and reduces code duplication by defaulting to all environments.

What are your thoughts on this?

Aug 11 '20 03:08 devshawn

Thanks for responding to my overly detailed issue description in kind! 😅

I mostly agree with your thoughts on the options I presented, and acquiesce that 5 is probably the best current solution. It's clear, simple to implement, and doesn't hide any essential complexity. Which together likely more than makes up for the little bit of duplication required. (My DRY-sensor is way overtuned anyways.)

I like your proposal on the surface. Listing valid environments and explicitly filtering entries by which environment they apply to seems like a good solution, and would be perfect for topics. It may be messy to manage a large number of differences, but in that case you could still fall back to separate files which would be preferable in that case anyway.

But how might one handle different service definitions per environment with this strategy? Some ideas:

Allow the service name to be declared twice. (Assuming your yaml parser even exposes both entries...) What happens if the environment lists overlap between two entries with the same service name?

How should the following be interpreted given --environment dev? Merge: my-service consumes both topics. Override: my-service consumes only my-other-topic Error: services with the same name must have mutually exclusive environment lists. (my preference)
```
services:
  my-service:
    # unspecified; assume it applies to all environments (?)
    consumes:
      - my-test-topic

  my-service:
    environments:
      - dev
    consumes:
      - my-other-topic
```
Require service entries for different environments to be named differently somehow? (Then we'd need a way to override all the properties that are derived from the service name, like in kafka connect.)
Only one entry per service, but allow filtering per topic in the consumes/produces lists somehow? (This wouldn't allow other settings to differ, such as principal.)

From these, my preference is 1 with a rule that no two entries with the same name can apply to the same environment. Maybe entries with no defined environment are overridden?

If you decide to go this way at all, I think it would be very good if there were a "render" subcommand that interprets the options, and writes the the filtered / combined state file back as yaml to stdout for closer inspection of how kafka-gitops interprets the state file given the selected options. This would be very useful as a debugging / validation / audit tool.

What do you think?

Aug 11 '20 23:08 infogulch

Great thoughts, thank you for giving my proposal a great once-over.

Services being different per environment didn't really cross my mind -- I thought some services might be per-environment, but not differ -- is that a valid use case you've encountered?

Great list of strategies here -- I think my preference would be number 1 as well. Here are my thoughts on the others:

Number 2: I personally would not like this approach -- complex to code and harder to understand (do I prefix with dev-, suffix with -dev, do we make it configurable, etc).
Number 3: This suggestion is a bit better but would require quite a bit of refactoring, ideas on this below...

For suggestion 1, the YAML parser does allow this I believe, so it would be possible. We'd have to then validate/investigate other properties (what happens if we duplicate the settings block?). I could see how it would also be beneficial to declare topics twice (e.g my retention time in QA might be 4 days and 7 days in production).

My preference would also be a validation error. Merging and overriding can lead to mistakes or be confusing to understand (see any complicated Helm chart with environment-based override values files...).

If you had one my-service service defined, it would go to the environments specified or all environments if none are specified
If you had two or more my-service services defined, the tool would require them to have a unique list of environments

My main issue with this is I personally like having a single place where something is defined. I like to have a service defined once, and all information regarding that service can be found there. In the separate folder-per-environment model, I can go into the QA folder and find my service and know I'm updating that service for the QA environment. In this model, I now go into the state file and find which service is linked to QA, and update that. Someone may update the wrong one, which is hopefully caught in a code review -- but is easier-to-do than when you have separate files, in my opinion.

My other thought would be to do something like this, but I think this also just gets overly complex. It has the problems of number 3, where you wouldn't be able to override other properties like principal.

services:
  my-service:
    environments:
      - dev
      - qa
    consumes:
      - dev-and-qa-only-topic
    settings:
      dev:
        consumes:
          - dev-only-topic

This does bring up the thought around topics, which have the same issue. If I have a small 2-node cluster in dev and a 3-node cluster in production, I need to specify different replication factors. What's the best way to handle that? Define the topics twice, or do I do something like the above? I'd probably lean more towards environment-based overrides:

topics:
  test-topic:
    partitions: 6
    replication: 3
    settings:
      dev:
        replication: 2

This leads me to think, how complex would this state file get? It may be easier to just keep them as separate files at that point instead of having one large one (though you can currently break services/topics into their own files).

I would be interested to see how other similar tools/projects have handled this in the past and see if those solutions fall in line with what we've come up with.

It's a lot to think about -- sorry for rambling on 😅 . I do completely agree if we implement the ability to handle environment-specific state, we should add a render command to output what the YAML would look like for a given environment.

Aug 13 '20 00:08 devshawn

It does bring me back to thinking that your original number 2 suggestion may be good -- allow a base state file and an overrides file. Helm has proven it works decently well. I would think about it less as "merge two state files" and more as "one base state file, with an optional override file" (with the override file being environment-specific or however the end-user chooses). I think both options have their pros and cons.

Even before you opened this issue, I was thinking about adding a diff/compare command to print the differences between two state files since that can help "validate" your two different environment files are not missing something critical.

Aug 13 '20 00:08 devshawn

Both of your examples trying to put environment-specific overrides underneath the main object declaration rubs me the wrong way, though I'm not sure I can articulate why. Maybe because it makes interpreting the block an exercise in applying the merging rules in your head.

So I think this means I prefer replacing the service/topic entry entirely or not at all. Each entry is a self-contained config, and which entry is applied is purely a function of which environment you specify. There are no finer-grained merging/interpreting rules.

I like the explicit environments idea as well though. Perhaps a combination of my original No 2 with explicit environments?

Proposal

Valid environments for the file are listed at the top of the file.
```
settings:
  environments:
    - staging
    - qa
```
Any 2nd level entry listing (a service, a topic, etc) may be tagged as only applying to specific environments.
```
services:
  my-service:
    environments:
      - staging
    consumes:
      - my-test-topic
```
2a. Omitting the environments list for an entry is interpreted as applying to all environments.
There may be more than one entry with the same name, as long as the listed environments do not overlap.
```
services:
  # my-service for staging from rule 2

  my-service:
    environments:
      - qa
    consumes:
      - my-other-topic
```
3a. Along with 2a, this means that any entry without an environments qualifier can never be overridden.

A state file may declare a 'default' environment list in settings that applies to all the entries in the file implicitly.

These are equivalent:

settings:
  environments:
    - qa

services:
  my-service:
    environments:
      - qa
    consumes:
      - my-test-topic

  my-producer-service:
    environments:
      - qa
    produces:
      - my-test-topic

settings:
  environments:
    - qa
  default-environments:    # (sp?)
    - qa

services:
  my-service:
    consumes:
      - my-test-topic

  my-producer-service:
    produces:
      - my-test-topic

Every state file must be self-contained and not reference any resources defined outside its scope.

5a. E.g. no service may reference a topic created in another state file.
You may pass -f [filename].yaml multiple times to combine entries from multiple files. This is implemented as:

6a. Apply any settings.default-environments block to all entries within each file. 6b. Validate that each provided state file complies with the rules individually. 6c. Concatenate the transformed entries from all files together. 6d. Only retain the settings block from the first file. (Not too sure about this one.) 6e. Validate that the concatenated result still complies with rules 1-5
You may pass --environment [name] to filter out all environment scoped entries that do not specify the named environment.

7a. Omitting --environment entirely filters out all entries with no explicit environment specified.

Thoughts

(2a/3a) You can create a global state file that always applies to every environment and cannot be overridden. Nice for security.
(7) A stabilized topic/service can graduate to not specifying any environment, locking its definition down to be equivalent across all environments. I.e. you could chose to use --environment production for production, or no explicit environment for everything including production.
(6) You can choose whether to maintain differences as environment qualifiers in a single file, or separate files, and there is a straightforward way to convert between the two different representations.
- (6) Small environment differences can be maintained in a single state file for small deployments.
- (6) As environment differences grow, an org can graduate to using separate state files per environment.
  - Separate state files enable using repo features such as required reviewers when specified paths are modified etc
(5) Services and topics that don't overlap at all can be maintained in different files (see: #24) and just merge gracefully.
- This would mean that teams that do not depend on each other can operate independently.
- Teams that do depend on a shared resource, a topic for example, must coordinate access to the resource in a central state file.
(5) Requiring each file to be self-contained enables a team to use only its own state file for managing team-local dev environments. E.g. my team only uses these 10 topics; only deploy those 10 for testing on my local machine, not the other 2000 topics that may be relevant to teams in the rest of the org.
(2/3/5) Differences/overrides are big, obvious, and 'chunky'. If you need some state to be different, you must duplicate the whole topic or service and ensure their environments lists are mutually exclusive.

Your thoughts?

Aug 13 '20 19:08 infogulch

After thinking about rule No 5, I'm not sure if this is the right way to solve dependency problems. E.g. how would a service consume from a 'shared' topic defined in a central state file and produce to a 'private' topic defined in a team's state file? I think we'll need to relax or eliminate 5 to avoid this.

Aug 13 '20 20:08 infogulch

Hey @infogulch, thanks for the list of thoughts! Just wanted to let you know I haven't forgot about this -- just have been super busy lately. I'll hopefully respond with some thoughts later this week. 😄

Sep 09 '20 19:09 devshawn

@devshawn how about a microservice like approach where each microservice is responsible to create its own topics? Holding a single state file for a whole cluster is not viable, typically the publisher service on a topic is responsible for creating it with the right configuration. Terraform solve this using potentially a separate state for each cluster, what about kafka-gitops? One solution could be to store the state with a name in a cmpacted topic (like kafka-connect does)

Nov 22 '20 21:11 edmondop

@edmondo1984 I like where you're going with this; we manage RabbitMQ in similar ways (where queue bindings are created by the consumers). I think this would scale well for the topic portion of kafka-gitops. Using a compacted topic for storing the state of kafka-gitops is a great idea.

However, a big portion of kafka-gitops is used for managing services and ACLs. Not everyone is using it for managing ACLs, but I'm not sure how you manage the security of "who can add what ACLs" if each service was to define its own ACLs.

Nov 23 '20 15:11 devshawn

@devshawn think about what happens when you run multiple microservices backed by a single physical database server. Each microservice should be able:

to create a new database
to create different users for their database (one to run schema changes, one that will be used by the app, and some read-only users let's say for a kafka connect instance polling via a jdbc connector)
to apply migrations to their database

In Kafka there is an additional challenge, as you mentioned: a topic actually works as an API, so the application who "owns" the topic (the producer) should also define who can consume it. I am not an expert in the Kafka ACL mechanism, but if it implements a decent RBAC architecture, then the solution should be the following:

The producer should define the topics they need
The producer should define roles. I.e. App1Topic1ConsumerRole , App1Topic2ConsumerRole, App1ConsumerRole
You need to have central governance about who creates user and attach role to them. This is the part that you still need to have centralized.

By the way, the current constraint of one state file for one cluster is limiting for us so we are looking into using Terraform instead: https://github.com/Mongey/terraform-provider-kafka . Terraform store its state in a number of potential stores (the equivalent of the kafka topic itself to store the state could be local, s3, the hashicorp enterprise solution, etc) and you can import in the current state the resources you need. We use terraform successfully to manage a large number of resources

Nov 23 '20 16:11 edmondop

kafka-gitops kafka-gitops copied to clipboard

Environment-specific state?

Proposal

Thoughts

kafka-gitops
kafka-gitops copied to clipboard