consul-alerts icon indicating copy to clipboard operation
consul-alerts copied to clipboard

Why not using config file?

Open sc0rp10 opened this issue 9 years ago • 16 comments

Hi. I want to use consul-alerts, but i'm totally discouraged: I can't store service configuration in config file. I need automated deploy of my infrastructure services via chef/ansible/puppet/salt/etc. Now after deploy i should to configure some thresholds, notifiers etc. Why not using config file with these settings? It would be nice, if I can store, edit and version configuration of consul-alerts like all another services.

sc0rp10 avatar Feb 18 '16 02:02 sc0rp10

consul-alerts is based off consul and uses it's keystore for failover and high reliability. Using the keystore ensures that config is always consistent across the cluster and ready for failover as well as being a more robust datastore. You can make use of this project for chef: chef-consul-kv That will allow you to add keys using chef that will configure everything you need automatically from a recipe. The things you might want to set from there are the profile settings for groups or nodes, configure new profiles. Many of the settings that are global are only needing to be set once and are easy to set by hand but it may be better to set them via chef on the systems that are running consul-alerts(last I looked the consul-alerts cookbook needs to be updated for chef changes).

fusiondog avatar Feb 18 '16 04:02 fusiondog

Storing the configuration in Consul has some practical issues, however.

By storing the configuration in the database, you sacrifice reproducibility and auditability of the configuration. You can use automated processes to update the configuration, but not via standard configuration management systems like Chef, Ansible, or Puppet. And a user can update the database manually, or create additional keys that the configuration script doesn't see and therefore cannot revert or delete. Configuration files, on the other hand, can represent atomic snapshots of the complete configuration, and can be stored in revision control so that one can observe its change history.

Consul doesn't also support strong auditing of the K/V store so that you can observe who changed what aspect of the configuration and when.

mfischer-zd avatar Mar 15 '16 15:03 mfischer-zd

I agree with @mfischer-zd that configuring consul-alerts via the Consul K/V store can cause issues in practice.

However I would say that Ansible can be used to effectively automate the management of consul-alerts - my team does it. You can store configuration for notifiers etc. in your inventory or roles and have playbook tasks that use the HTTP API to the K/V store to push the configuration in. And of course we keep our Ansible playbooks and roles in revision control so we have change history and all that.

And we have found some advantages to driving the dynamic configuration of consul-alerts behaviour from the K/V store. For example, we have Ansible playbooks that bring down services for upgrade/reconfigure, do rolling reboots of clusters, etc. To avoid the storm of alerts about services going down, we have the playbooks do an initial task to blacklist the service/host going down and then remove the blacklist once the service/host comes back up.

matthewlowry avatar Mar 17 '16 04:03 matthewlowry

If you are concerned about changes to your config from unwanted sources you could make use of the ACLs to prevent unwanted changes. https://www.consul.io/docs/internals/acl.html

The ACLs can be setup via standard config management tools with chef cookbooks. https://github.com/johnbellone/consul-cookbook

There are also consul_kv and consul alert cookbooks that could be used to load the config. https://github.com/dpetzel/consul_kv_cookbook https://github.com/dpetzel/consul_alerts-cookbook

If you wish you could make an init script that uses consulate command to load a json config file into consul on startup. https://github.com/gmr/consulate

Granted consul does not audit changes, but with ACL in place it isn't really an issue. Also most file systems do not audit file changes by default either. So this would not be a feature inherent to using a config file over consul kv. However if auditing is very important to you, it may be possible that a solution using Vault on Consul could be devised. https://www.vaultproject.io/intro/vs/consul.html https://www.vaultproject.io/intro/index.html

fusiondog avatar Mar 17 '16 15:03 fusiondog

Auditing is not identical to ACLs. And configuration file histories are significantly easier to audit and store in change control than database manipulations.

I'm not suggesting that configurations stored in Consul not be supported for those who like to do it that way. But those who wish to store their configurations in files instead have good reason to, and it should be supported as well.

mfischer-zd avatar Mar 24 '16 20:03 mfischer-zd

Storing the running config in consul is necessary. It cannot be worked around. If the individual instances just store config in local memory they could be inconsistent. So however you store your config it needs to be loaded into consul at run time.

I understand you wish to have a config to version and store. I was trying to address that and provide you options in my previous posts. There are multiple ways already listed I'm this ticket.

  1. Use chef or ansible and store your recipes/config in version control.
  2. Create a JSON config file that you versiin and load on init with consulate.

fusiondog avatar Mar 25 '16 02:03 fusiondog

Storing the running config in consul is necessary. It cannot be worked around

Respectfully, I think you mean "I don't want to," not that you can't do it. If you really cannot do it, it'd be helpful to discuss the technical reasons why.

With respect to the consistency argument, how many instances of this are supposed to be running simultaneously? If the answer is "none" (you don't want multiple instances alerting simultaneously, as it would issue duplicate notifications), then strict consistency is unnecessary.

We've already discussed why the alternatives you proposed are unpalatable to us, including the complexity of loading it, and the possibility that the configuration made through a series of K/V inserts is not identical to an atomic flush-and-restore operation (which would leave a running system without configuration for a time). Loading even a complete dump into a database is not an adequate substitute for having a full and complete configuration file that is accessible on the filesystem.

mfischer-zd avatar Mar 25 '16 02:03 mfischer-zd

Actually, consul-alerts is designed to run an instance on each consul server ( not every consul agent ). The instances make use of consul leader election to determine a master instance. In the event of failure of the master instance consul-alerts will immediately fail over to another instance via consul election. For this feature to function correctly it is REQUIRED that config be consistent. As this tool is built around consul, it makes complete sense to use consul's innate features to accomplish this.

Doing otherwise would completely undermine and invalidate this core design.

I understand that this is a different paradigm and I do appreciate that it presents you with challenges. But I do not feel it is an issue of preference, but rather a fundamental technical issue which I hope I've now clarified.

I am open to suggestions, but barring a complete redesign and even a complete reconception of even design goals, this isn't an option.

fusiondog avatar Mar 25 '16 03:03 fusiondog

First, the documentation doesn't say anything about automated failover, or where it must run (and I wouldn't want to run this on a Consul server anyway since alerting is a separate concern and need not share host resources).

At any rate, is entirely possible to have N+1 functionality using consul's mutex functionality (e.g. wrapping it with consul lock) but using replicated configuration files. We know because we do it ourselves already with other software. The use of the mutex doesn't necessarily imply that one must also use the K/V store for configuration.

Look, I get why you want to use it - for you and others it can be extremely convenient. Nor am I suggesting you abandon using K/V for configuration. But I find your claims of necessity unconvincing, and there are legitimate reasons for people to want file-based configuration.

mfischer-zd avatar Mar 25 '16 03:03 mfischer-zd

The document could be improved on that point. I know much of this from documentation in the code and the code itself that I read when I recently began contributing to this project.

It isn't required that it run on the consul servers, though in many installations it makes sense to consolidate functions. If you find that load exceeds acceptable levels on your consul servers you could instead run on other hosts and point to either a local or remote consul agent. It is only a question of a specific deployment. It isn't mandatory that you run multiple instances either ( though you forfeit availability ). Though if you are running a single instance, atomic removal and restore using consulate is less of a concern if it is wrapped around a restart of the singleton service instance.

I don't think the choice to use the kv was based on the use of the mutex. Without using the KV the question of config replication and consistency remains an open one. It would otherwise be reliant of some external 3rd party code or structure that wouldn't provide any guarantees needed for predictable behavior during failover. Using chef/ansible/etc. relies on scheduled polling that could leave configs inconsistent for long periods of time. Not using the KV makes me think of phrases like: "Don't look a gift horse in the mouth" or "Don't reinvent the wheel". Consul supplies a ready solution to these issues that is guaranteed to be available to consul-alerts.

I concur that atomic flush and restore remains problematic in a cluster scenario, but I do not feel it is a consul-alerts specific issue and is a generic issue applicable to many possible cases in consul dependent tools. I think it would be best to add a flag to consulate restore that first reads the tree at the target KV directory into memory, then restores from the JSON config file, then compares tables and prunes missing. I took a quick look at the consulate code and I don't think it should be too difficult to implement.

fusiondog avatar Mar 25 '16 07:03 fusiondog

Here is a closed issue that speaks a bit to the use of multiple instances: https://github.com/AcalephStorage/consul-alerts/issues/92

fusiondog avatar Mar 25 '16 09:03 fusiondog

Not using the KV makes me think of phrases like: "Don't look a gift horse in the mouth" or "Don't reinvent the wheel". Consul supplies a ready solution to these issues that is guaranteed to be available to consul-alerts.

Insisting on using solely the KV store, on the other hand, reminds me of the phrase "when you have a hammer, everything looks like your thumb." :)

I concur that atomic flush and restore remains problematic in a cluster scenario, but I do not feel it is a consul-alerts specific issue and is a generic issue applicable to many possible cases in consul dependent tools.

That simply tells us how widespread the problem is!

I think it would be best to add a flag to consulate restore that first reads the tree at the target KV directory into memory, then restores from the JSON config file, then compares tables and prunes missing. I took a quick look at the consulate code and I don't think it should be too difficult to implement.

I suspect that this will be subject to a race condition, unless there's a way to signal to consul-alerts to ignore changes in the configuration store until the load and cleanup work has finished. (There's nothing in the documentation about signaling reloads, BTW.)

mfischer-zd avatar Mar 25 '16 14:03 mfischer-zd

Hammer or wheel, it is getting the job done. At this point there is not any other option presented that achieves the design goals or supports the features. When I first took a look at the code base I was also put off by the way config was loaded. I'm not making a religious argument here. After reviewing the details of the code it became clear that using the KV was the most straight forward means to achieve the goals of Consistency and Availability. Anything else would add unneeded layers of complexity. If I had not reviewed this before deciding to commit my efforts, I would not have bothered. One would need to present a detailed alternative proposal, that meets the goals of the project, to make a strong case otherwise.

I feel that the idea that loading into consul is a problem is a subjective viewpoint. Consul is not simply designed to run checks and replace nagios, it is a tool that tackles ongoing issues with the development of cluster computing and shared memory. https://www.consul.io/intro/vs/zookeeper.html 99.99% of applications load their config files into memory before acting on any of the data. Memory is a local Key(Memory Address)/Value(Physical Memory block) store that is made available to your applications. All of the same concerns exist locally, but are simply Givens of the current paradigm which exist at a lower level than those with which one usually concerns themselves. If you shift perspective to the cluster application level then these issues must be addressed in new ways. Consul is built to deal with these issues. I believe it is important and productive that you have been involved in bringing these questions to light and you share glory with the likes of Zeno of Elea, but I honestly don't believe I am being the least hyperbolic to say that removing consul KV would basically make this an entirely different project, require completely redefining the goals and most likely require a nearly complete rewrite.

With regards to the race condition on the proposed restore and prune method: There would indeed be a window for a race condition during config load. This window would most likely be in the range of a few microseconds at best and a few seconds in worst case scenario. Even with no fix this is a vast improvement on the window that would be created during changes with some external config management such as chef. Chef would need to measure it's window of inconsistency on the order of minutes in best case ( this is a best case that assumes you are constantly running chef with no break and the chef run ONLY takes 1-2 minutes ) and in worst case is measured in hours ( it is not uncommon for chef runs to be scheduled 12 hrs apart or longer. Nor is it too uncommon for chef runs to last into the 10s of minutes). However it occurs to me that the best remediation is to enable the ability to pause notifications in consul-alerts. This could be implemented with a key value that stores 3 states ( pause, unpause, replay) of the pause feature and a queue in consul for paused notifications. In paused mode notifications would be processed into the queue and all other notification logic bypassed ( this is already done for service instances that are not in leader mode so this should be trivial ). Then when ready to process notifications again the state is changed to "replay". Then on the next notification incident and/or an API trigger all queued notifications are processed and removed from the queue. When all notifications are removed from the queue the state is returned to unpaused and operation continues normally.

I feel at this point in the discussion there have been established some very useful action items that we can take away:

  1. Improved documentation on how consul-alerts works, goals and design core use cases.
  2. Improved function of consulate restore feature to enable pruning. This provides the most use to the consul community with a single change.
  3. Documentation on how one could use the tools to load consul-alerts config from a JSON config file. Perhaps going so far as adding a wrapper shell or init script to the repository.
  4. Addressing config loading/editing related race condition through the addition of a generic "pause" feature. The pause feature may prove useful to other users in situations.

fusiondog avatar Mar 25 '16 20:03 fusiondog

I've added the features to consulate and they have been accepted upstream. The change isn't yet in the default pip version yet. If you would like to test or use the features in the mean time you may install the tool like so: pip install --upgrade git+https://github.com/gmr/consulate.git

fusiondog avatar Mar 30 '16 08:03 fusiondog

This requires python-pip and git installed of course. Example usage: consulate kv backup consul-alerts/config -f tmp.json consulate kv restore consul-alerts/config -f tmp.json --prune

fusiondog avatar Mar 30 '16 09:03 fusiondog

@fusiondog I'm also coming here trying to use consul-alerts in a fully declarative environment (i.e. not using a "task runner" scripting system like Ansible).

Adding support for external tools like consulate already makes things simpler.

But I think the simplest solution would just be to add an optional command line flag to consul-alerts where you can give it a json file that it'll store in the consul KV sotre, unconditionally at startup.

E.g. consul-alerts --write-config config.json.

nh2 avatar Jun 24 '18 17:06 nh2