critical icon indicating copy to clipboard operation
critical copied to clipboard

Infrastructure Monitoring As Code

= Critical Monitoring should be a layer in the stack, not an application.

= Installing

gem install rspec

git clone git://github.com/danielsdeleo/critical.git

= Manifesto Critical is my take on network/infrastructure monitoring. Here are the big ideas:

  • Infrastructure as code: The monitoring system should be an internal DSL so it can natively interact with any part of your infrastructure you can find or write a library for. You should also be able to productively alter its guts if you need to. This is a monitoring system for ops people who write code and coders who do ops.
  • Client-based: This scales better, and is actually easier to configure if you use configuration management, which you should be doing anyway.
  • Continuous verification: Critical has a single shot mode in addition to the typical daemonized operation. This allows you to verify the configuration on a host after making changes and then continuously monitor the state of the system using the same verification tests.
  • Declarative: Declare what the state of your system is supposed to be.
  • Alerting and Trending together: a client/agent can do both of these at the same time with less configuration overhead. It makes sense to keep them separate on the server side.
  • Licensing: "Do what thou wilt shall be the whole of the law," except for patent trolls, etc. So, Apache 2.0 it is.

= Design Critical runs as a cluster of daemons. The master process does the scheduling and assigns tasks to workers by communicating over a UNIX domain socket. The workers listen to the socket and process tasks as they come. I had also considered an evented architecture (using eventmachine), but that had the drawback of requiring users to write plugins using only EM-based libraries or risk running into problems with blocking IO.

== Metric DSL Critical provides a DSL for writing metric gathering code. It looks like this:

Metric(:memory_utilization) do case RUBY_PLATFORM when /darwin/ # omitted...

when /linux/

  collects 'free -b'

  reports(:total_memory_in_kb => :int) do
    result.line(1).split[1]
  end

  reports(:bytes_used => :int) do
    result.line(2).split[2]
  end

else
  raise UnsupportedPlatform, "memory_utilization does not have an implementation for your platform yet :("
end

reports(:kb_free => :integer) do
  bytes_free / 1024
end

reports(:kb_used => :integer) do
  pp :kb_used => (bytes_used / 1024)
  bytes_used / 1024
end

reports(:mb_free => :float) do
  kb_free / 1024.0
end

reports(:mb_used => :float) do
  kb_used.to_f / 4.0
end

end

== Using Metrics To configure critical to monitor your metrics, you use the monitor DSL:

require_metric 'disk_utilization' require_metric 'memory_utilization' require_metric 'cpu_utilization' require_metric 'cluster'

Monitors are also where you define your scheduling.

Monitor(:system) do

# Monitor statements can be nested, this nesting will be included in the
# collected data for tracking/tagging purposes.
Monitor(hostname) do # includes the hostname in the namespace

  # Specify collection intervals with +every+ or +collect_every+
  # The +every+ form takes a block, each monitor you define inside the block
  # will be scheduled to run at that interval.
  every(10=>:seconds) do

    disk_utilization('/') { track :percentage }

    memory_utilization { track :bytes_used }

    cpu_utilization {track :percent_used}

    cluster("critical : worker") do |c|
      c.track :processes
      c.track :total_cpu
      c.track :total_rss
      c.track :uptime
    end

  end
end

end

= Running Critical: See bin/critical --help and the examples directory

== Project Status Initial work focused on the alerting half of the alerting/trending combo that comprises "monitoring." I've pivoted and am currently focusing on making it dead simple to get data into graphite. Alerting is still a long term priority.

= License and Copyright Distributed under the terms of the Apache 2.0 license. (c) 2010,2011 Daniel DeLeo