satellite Not all slaves appear on whitelist

I have 2 Mesos clusters. One with 3 servers each running ZK, Mesos master, Mesos slave, Satellite master & Satellite slave. My 2nd cluster has 6 servers separating out masters and slaves. On both clusters all 3 Mesos slaves are active.

At one point, on the first cluster (3 servers), I was able to see all slaves registered in the whitelist. I had a high load on 2 of the servers. Those 2 servers dropped from the whitelist. Once the load came down, one slave re-registered with the whitelist and the other didn't.

On my 2nd cluster (6 servers), 1 of the slave servers never registers.

In both cases, I see mesos/slave events from all 3 satellite slaves reaching the satellite master leader.

Cluster 1:

CentOS 6.5
kernel 3.10.75-1.el6.elrepo.x86_64
OpenJDK 1.7.0_51
ZooKeeper 3.4.3
Mesos 0.22.0
Satellite 0.2.0

Cluster 2:

CentOS 7.1.1503
kernel 3.10.0-229.11.1.el7.x86_64
OpenJDK 1.8.0_51
ZooKeeper 3.4.6
Mesos 0.23.0
Satellite 0.2.0

/usr/sbin/mesos-master --zk=zk://jb5.example.com:2181,jb6.example.com:2181,jb7.example.com:2181/mesos --port=5050 --log_dir=/var/log/mesos --quorum=2 --whitelist=/etc/mesos/slave_list.txt --work_dir=/var/lib/mesos

On all 3 Mesos masters, the whitelist file starts out like this:

$ cat /etc/mesos/slave_list.txt
jb8.example.com
jb9.example.com
jb10.example.com

Each Satellite master has a satellite-config.clj file that looks similar to this. The only difference is the config will point to the server it's installed on. For example, on jb6, all the jb5s are replaced with jb6.

(def settings
  (merge settings
         {:mesos-master-url (url/url "http://jb5.example.com:5050")
          :sleep-time 5000
          :zookeeper "jb5.example.com:2181"
          :local-whitelist-path "/etc/mesos/slave_list.txt"
          :riemann-tcp-server-options {:host "jb5.example.com" }
          :service-host "jb5.example.com"}))

I made no changes to the satellite-config.clj file. My slave-config.clj file looks like this:

(def mesos-work-dir "/var/lib/mesos")

(def settings
  {:satellites [{:host "jb5.example.com"}
                {:host "jb6.example.com"}
                {:host "jb7.example.com"}]
   :service "mesos/slave/"
   :comets [(satellite-slave.recipes/free-memory 50 (-> 60 t/seconds))
            (satellite-slave.recipes/free-swap   50 (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 "/tmp" (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 "/var" (-> 60 t/seconds))
            (satellite-slave.recipes/percentage-used 90 mesos-work-dir
                                                     (-> 60 t/seconds))
            (satellite-slave.recipes/num-uninterruptable-processes 10 (-> 60 t/seconds))
            (satellite-slave.recipes/load-average 4.5 (-> 60 t/seconds))
            {:command ["echo" "17"]
             :schedule (every (-> 60 t/seconds))
             :output (fn [{:keys [out err exit]}]
                       (let [v (-> out
                                   (clojure.string/split #"\s+")
                                   first
                                   (Integer/parseInt))]
                         [{:state "ok"
                           :metric v
                           :ttl 300
                           :description "example test -- number of files/dirs in cwd"}]))}]})

So the only change I made was to add my satellite masters. If I query the whitelist API I see the following results:

{"jb10.np.ev1.yellowpages.com":
  {"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb8.np.ev1.yellowpages.com":
  {"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb9.np.ev1.yellowpages.com":
  {"managed-events":{},"manual-events":{},"managed-flag":null,"manual-flag":null}}

In my slave-config.clj, I changed the mesos-work-dir from /tmp/mesos to /var/lib/mesos. I restarted the slaves. This caused expired events for /tmp/mesos. That expired event ended up changing the whitelist and now jb9 & jb10 are 'on' and jb8 is 'off'.

{"jb10.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076213197E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb10.example.com"}
    },
    "manual-events":{},"managed-flag":"on","manual-flag":null
  },
"jb8.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076258248E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb8.example.com"}
    },
    "manual-events":{},"managed-flag":"off","manual-flag":null
  },
"jb9.example.com":
  {"managed-events":
    {"mesos\/slave\/percentage used of \/tmp\/mesos":
      {"time":1.444076198146E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb9.example.com"}
    },
    "manual-events":{},"managed-flag":"on","manual-flag":null
  }
}

And /metrics/snapshot also says 2 out of 3:

{"prop-available-hosts":0.6666666666666667,"num-available-hosts":3,"num-hosts-up":2}

Oct 05 '15 20:10 juicedM3

I just restarted my Satellite masters on cluster 2 and now all three slaves are on the whitelist. It's still showing the expired event. So far it needed a failed test and a Satellite master restart for all slaves to show up on the whitelist.

Oct 05 '15 21:10 juicedM3

Thanks for the report. Can you also post what your riemann-config.clj was?

Oct 08 '15 17:10 sabraham

Np. Above, I think I meant to say that I made no changes to the riemann-config.clj but wrote satellite instead. Either way, here's my riemann-config.clj. Thanks!

;; default emailer uses localhost 
(def email (mailer))  
(def indx (index)) 
 (streams
  (where (service #"satellite.*")
         prn)
  (where (service #"mesos/slave.*")
         prn
         ;; if a host/service pair changes state, update global state
         (changed-state
          (where (state "ok")
                 delete-event
                 (else
                  persist-event)))
         ;; If we stop receiving any test from a host, remove that host
         ;; from the whitelist. We don't want to send tasks to a host
         ;; that is (a) experiencing a network partition or (b) whose
         ;; tests are timing-out. If it is (c) that the satellite-slave
         ;; process is down, this at least warrants investigation.
         (where* expired?
                 (fn [event]
                   (warn "Removing host due to expired event" (:host event))
                   (off-host (:host event)))
                 ;; Otherwise make sure all tests pass on each host
                 (else
                  (coalesce 60
                            ensure-all-tests-pass))))

  ;; if less than 70% of hosts registered with mesos are
  ;; on the whitelist, alert with an email
  ;; (where (and (service #"mesos/prop-available-hosts")
              ;; (< metric 0.7))
         ;; (email "[email protected]"))
  )
)

Oct 08 '15 18:10 juicedM3

Awesome, thanks!

Oct 08 '15 18:10 sabraham

I'm prepared to attempt to reproduce this issue. Has it happened again to you @juicedM3 ? @sabraham have there been any developments since your last comment here?

Nov 18 '15 15:11 mforsyth

Nothing substantial to report, @mforsyth

Nov 18 '15 15:11 sabraham

@mforsyth sadly, we've moved away from Satellite. Our client wanted to run a lighter weight slave/agent (even tho memory is cheap) and that just removed so many features of Satellite. Plus we have no one here that really knows clojure. So if we wanted to dissect things apart and glue it back together, maintaining it will be difficult for them.

Nov 18 '15 18:11 juicedM3

@juicedM3 thanks for letting us know your reasons!

Nov 18 '15 18:11 mforsyth

@mforsyth we're definitely going to keep an eye on the project. I'm hoping to increase my clojure knowledge in the future. So there might be a point where things come back around. Thanks!

Nov 18 '15 18:11 juicedM3

satellite satellite copied to clipboard

Not all slaves appear on whitelist

satellite
satellite copied to clipboard