satellite
satellite copied to clipboard
Not all slaves appear on whitelist
I have 2 Mesos clusters. One with 3 servers each running ZK, Mesos master, Mesos slave, Satellite master & Satellite slave. My 2nd cluster has 6 servers separating out masters and slaves. On both clusters all 3 Mesos slaves are active.
At one point, on the first cluster (3 servers), I was able to see all slaves registered in the whitelist. I had a high load on 2 of the servers. Those 2 servers dropped from the whitelist. Once the load came down, one slave re-registered with the whitelist and the other didn't.
On my 2nd cluster (6 servers), 1 of the slave servers never registers.
In both cases, I see mesos/slave events from all 3 satellite slaves reaching the satellite master leader.
Cluster 1:
- CentOS 6.5
- kernel 3.10.75-1.el6.elrepo.x86_64
- OpenJDK 1.7.0_51
- ZooKeeper 3.4.3
- Mesos 0.22.0
- Satellite 0.2.0
Cluster 2:
- CentOS 7.1.1503
- kernel 3.10.0-229.11.1.el7.x86_64
- OpenJDK 1.8.0_51
- ZooKeeper 3.4.6
- Mesos 0.23.0
- Satellite 0.2.0
/usr/sbin/mesos-master --zk=zk://jb5.example.com:2181,jb6.example.com:2181,jb7.example.com:2181/mesos --port=5050 --log_dir=/var/log/mesos --quorum=2 --whitelist=/etc/mesos/slave_list.txt --work_dir=/var/lib/mesos
On all 3 Mesos masters, the whitelist file starts out like this:
$ cat /etc/mesos/slave_list.txt
jb8.example.com
jb9.example.com
jb10.example.com
Each Satellite master has a satellite-config.clj file that looks similar to this. The only difference is the config will point to the server it's installed on. For example, on jb6, all the jb5s are replaced with jb6.
(def settings
(merge settings
{:mesos-master-url (url/url "http://jb5.example.com:5050")
:sleep-time 5000
:zookeeper "jb5.example.com:2181"
:local-whitelist-path "/etc/mesos/slave_list.txt"
:riemann-tcp-server-options {:host "jb5.example.com" }
:service-host "jb5.example.com"}))
I made no changes to the satellite-config.clj file. My slave-config.clj file looks like this:
(def mesos-work-dir "/var/lib/mesos")
(def settings
{:satellites [{:host "jb5.example.com"}
{:host "jb6.example.com"}
{:host "jb7.example.com"}]
:service "mesos/slave/"
:comets [(satellite-slave.recipes/free-memory 50 (-> 60 t/seconds))
(satellite-slave.recipes/free-swap 50 (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 "/tmp" (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 "/var" (-> 60 t/seconds))
(satellite-slave.recipes/percentage-used 90 mesos-work-dir
(-> 60 t/seconds))
(satellite-slave.recipes/num-uninterruptable-processes 10 (-> 60 t/seconds))
(satellite-slave.recipes/load-average 4.5 (-> 60 t/seconds))
{:command ["echo" "17"]
:schedule (every (-> 60 t/seconds))
:output (fn [{:keys [out err exit]}]
(let [v (-> out
(clojure.string/split #"\s+")
first
(Integer/parseInt))]
[{:state "ok"
:metric v
:ttl 300
:description "example test -- number of files/dirs in cwd"}]))}]})
So the only change I made was to add my satellite masters. If I query the whitelist API I see the following results:
{"jb10.np.ev1.yellowpages.com":
{"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb8.np.ev1.yellowpages.com":
{"managed-events":{},"manual-events":{},"managed-flag":"on","manual-flag":null},
"jb9.np.ev1.yellowpages.com":
{"managed-events":{},"manual-events":{},"managed-flag":null,"manual-flag":null}}
In my slave-config.clj, I changed the mesos-work-dir from /tmp/mesos to /var/lib/mesos. I restarted the slaves. This caused expired events for /tmp/mesos. That expired event ended up changing the whitelist and now jb9 & jb10 are 'on' and jb8 is 'off'.
{"jb10.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076213197E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb10.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null
},
"jb8.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076258248E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb8.example.com"}
},
"manual-events":{},"managed-flag":"off","manual-flag":null
},
"jb9.example.com":
{"managed-events":
{"mesos\/slave\/percentage used of \/tmp\/mesos":
{"time":1.444076198146E9,"state":"expired","service":"mesos\/slave\/percentage used of \/tmp\/mesos","host":"jb9.example.com"}
},
"manual-events":{},"managed-flag":"on","manual-flag":null
}
}
And /metrics/snapshot also says 2 out of 3:
{"prop-available-hosts":0.6666666666666667,"num-available-hosts":3,"num-hosts-up":2}
I just restarted my Satellite masters on cluster 2 and now all three slaves are on the whitelist. It's still showing the expired event. So far it needed a failed test and a Satellite master restart for all slaves to show up on the whitelist.
Thanks for the report. Can you also post what your riemann-config.clj
was?
Np. Above, I think I meant to say that I made no changes to the riemann-config.clj but wrote satellite instead. Either way, here's my riemann-config.clj. Thanks!
;; default emailer uses localhost
(def email (mailer))
(def indx (index))
(streams
(where (service #"satellite.*")
prn)
(where (service #"mesos/slave.*")
prn
;; if a host/service pair changes state, update global state
(changed-state
(where (state "ok")
delete-event
(else
persist-event)))
;; If we stop receiving any test from a host, remove that host
;; from the whitelist. We don't want to send tasks to a host
;; that is (a) experiencing a network partition or (b) whose
;; tests are timing-out. If it is (c) that the satellite-slave
;; process is down, this at least warrants investigation.
(where* expired?
(fn [event]
(warn "Removing host due to expired event" (:host event))
(off-host (:host event)))
;; Otherwise make sure all tests pass on each host
(else
(coalesce 60
ensure-all-tests-pass))))
;; if less than 70% of hosts registered with mesos are
;; on the whitelist, alert with an email
;; (where (and (service #"mesos/prop-available-hosts")
;; (< metric 0.7))
;; (email "[email protected]"))
)
)
Awesome, thanks!
I'm prepared to attempt to reproduce this issue. Has it happened again to you @juicedM3 ? @sabraham have there been any developments since your last comment here?
Nothing substantial to report, @mforsyth
@mforsyth sadly, we've moved away from Satellite. Our client wanted to run a lighter weight slave/agent (even tho memory is cheap) and that just removed so many features of Satellite. Plus we have no one here that really knows clojure. So if we wanted to dissect things apart and glue it back together, maintaining it will be difficult for them.
@juicedM3 thanks for letting us know your reasons!
@mforsyth we're definitely going to keep an eye on the project. I'm hoping to increase my clojure knowledge in the future. So there might be a point where things come back around. Thanks!