manageiq
manageiq copied to clipboard
[WIP] Ensure workers and miq_worker rows match
It is possible for the workers and the miq_worker rows to get out of sync
https://github.com/ManageIQ/manageiq/issues/22644
The first 10 failures are from awesome_spawn update.
Think this one is reasonable:
11) MiqServer::WorkerManagement::Kubernetes#sync_from_system #ensure_kube_monitors_started podified, ensures pod monitor started and orphaned rows are removed
Failure/Error: expect(server.worker_manager).to receive(:cleanup_orphaned_worker_rows)
(#<MiqServer::WorkerManagement::Kubernetes:0x0000564498f5f478 @my_server=#<MiqServer id: 88000000002459, guid: "4ae48e49-aef7-4e8c-9c53-361dc1a662a5", status: "started", started_on: "2023-11-02 15:58:19.285129921 +0000", stopped_on: nil, pid: nil, build: nil, percent_memory: nil, percent_cpu: nil, cpu_time: nil, name: "miq_server_0000000002315", last_heartbeat: "2023-11-02 15:58:19.285110421 +0000", os_priority: nil, is_master: false, logo: nil, version: "9.9.9.9", zone_id: 88000000004083, memory_usage: nil, memory_size: nil, hostname: nil, ipaddress: nil, drb_uri: nil, mac_address: nil, vm_id: nil, has_active_userinterface: nil, has_active_webservices: nil, sql_spid: nil, log_file_depot_id: nil, proportional_set_size: nil, has_active_remote_console: nil, system_memory_free: nil, system_memory_used: nil, system_swap_free: nil, system_swap_used: nil, has_active_cockpit_ws: nil, unique_set_size: nil, has_vix_disk_lib: nil>, @workers_lock=#<Sync:0x0000564498f5f388 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5f1d0>>, @workers={}, @queue_messages_lock=#<Sync:0x0000564498f5f068 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5eed8>>, @queue_messages={}>).cleanup_orphaned_worker_rows(*(any args))
expected: 1 time with any arguments
received: 0 times with any arguments
# ./spec/models/miq_server/worker_management/kubernetes_spec.rb:125:in `block (4 levels) in <top (required)>'
@kbrock yeah this isn't close to done yet I expect the tests to fail right now
Reran after merge of #22772
@agrare I don't remember the details why but I recall worker rows are created by the server in spawn/systemd but in the pod itself when it invokes run_single_worker.rb. In other words, for a moment we have a row without a process in the spawn/systemd case and a pod without a row in the podified case. I don't know if this change or my change I tested earlier this week would cause a problem with this existing assumption.
In my case, I was attempting to reload the worker row before each heartbeat and let the process exit if the row was removed.
diff --git a/app/models/miq_worker/runner.rb b/app/models/miq_worker/runner.rb
index 6b7ed69f3d..c8dd3ffd9c 100644
--- a/app/models/miq_worker/runner.rb
+++ b/app/models/miq_worker/runner.rb
@@ -319,6 +319,7 @@ class MiqWorker::Runner
# Heartbeats can be expensive, so do them only when needed
return if @last_hb.kind_of?(Time) && (@last_hb + worker_settings[:heartbeat_freq]) >= now
+ reload_worker_record
systemd_worker? ? @worker.sd_notify_watchdog : heartbeat_to_file
if config_out_of_date?
@jrafanie is a great point and something we need to be careful about, kubernetes creates the worker rows in run_single_worker because we can't know the GUID prior to starting the pod (all pods in a replica set have to have the same environment). Kubernetes was already deleting all worker rows that didn't match any running pods.
For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.
For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?
For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.
For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?
Yeah, that could work. We've seen it before where pods are really slow to start in our 🔥 environments and such as when CPU throttling is in play. Anything we do should account for possibly significant delays in both situations:
- Yeah, think you're right for "creating" rows in systemd/spawn. We do allow for some additional time, using starting timeout... https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/settings.rb#L51 via https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/validation.rb#L5.
- For podified, we liveness probes per pod but it's likely that's running independently from the creation of the worker row. Killing it come be hard because it could be in a reboot loop if it's continually getting throttled or delayed. We'll have to think about that.
This pull request has been automatically marked as stale because it has not been updated for at least 3 months.
If these changes are still valid, please remove the stale
label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).
app/models/miq_server/worker_management/kubernetes.rb
- [ ] :warning: - Line 58, Col 5 - Lint/EmptyBlock - Empty block detected.
- [ ] :exclamation: - Line 57, Col 7 - Style/CommentAnnotation - Annotation keywords like
TODO
should be all upper case, followed by a colon, and a space, then a note describing the problem.
This pull request has been automatically marked as stale because it has not been updated for at least 3 months.
If these changes are still valid, please remove the stale
label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).