manageiq [WIP] Ensure workers and miq

It is possible for the workers and the miq_worker rows to get out of sync

https://github.com/ManageIQ/manageiq/issues/22644

Nov 02 '23 15:11 agrare

The first 10 failures are from awesome_spawn update.

Think this one is reasonable:

  11) MiqServer::WorkerManagement::Kubernetes#sync_from_system #ensure_kube_monitors_started podified, ensures pod monitor started and orphaned rows are removed
      Failure/Error: expect(server.worker_manager).to receive(:cleanup_orphaned_worker_rows)

        (#<MiqServer::WorkerManagement::Kubernetes:0x0000564498f5f478 @my_server=#<MiqServer id: 88000000002459, guid: "4ae48e49-aef7-4e8c-9c53-361dc1a662a5", status: "started", started_on: "2023-11-02 15:58:19.285129921 +0000", stopped_on: nil, pid: nil, build: nil, percent_memory: nil, percent_cpu: nil, cpu_time: nil, name: "miq_server_0000000002315", last_heartbeat: "2023-11-02 15:58:19.285110421 +0000", os_priority: nil, is_master: false, logo: nil, version: "9.9.9.9", zone_id: 88000000004083, memory_usage: nil, memory_size: nil, hostname: nil, ipaddress: nil, drb_uri: nil, mac_address: nil, vm_id: nil, has_active_userinterface: nil, has_active_webservices: nil, sql_spid: nil, log_file_depot_id: nil, proportional_set_size: nil, has_active_remote_console: nil, system_memory_free: nil, system_memory_used: nil, system_swap_free: nil, system_swap_used: nil, has_active_cockpit_ws: nil, unique_set_size: nil, has_vix_disk_lib: nil>, @workers_lock=#<Sync:0x0000564498f5f388 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5f1d0>>, @workers={}, @queue_messages_lock=#<Sync:0x0000564498f5f068 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5eed8>>, @queue_messages={}>).cleanup_orphaned_worker_rows(*(any args))
            expected: 1 time with any arguments
            received: 0 times with any arguments
      # ./spec/models/miq_server/worker_management/kubernetes_spec.rb:125:in `block (4 levels) in <top (required)>'

Nov 02 '23 16:11 kbrock

@kbrock yeah this isn't close to done yet I expect the tests to fail right now

Nov 02 '23 16:11 agrare

Reran after merge of #22772

Nov 02 '23 16:11 Fryguy

@agrare I don't remember the details why but I recall worker rows are created by the server in spawn/systemd but in the pod itself when it invokes run_single_worker.rb. In other words, for a moment we have a row without a process in the spawn/systemd case and a pod without a row in the podified case. I don't know if this change or my change I tested earlier this week would cause a problem with this existing assumption.

In my case, I was attempting to reload the worker row before each heartbeat and let the process exit if the row was removed.

diff --git a/app/models/miq_worker/runner.rb b/app/models/miq_worker/runner.rb
index 6b7ed69f3d..c8dd3ffd9c 100644
--- a/app/models/miq_worker/runner.rb
+++ b/app/models/miq_worker/runner.rb
@@ -319,6 +319,7 @@ class MiqWorker::Runner
     # Heartbeats can be expensive, so do them only when needed
     return if @last_hb.kind_of?(Time) && (@last_hb + worker_settings[:heartbeat_freq]) >= now

+    reload_worker_record
     systemd_worker? ? @worker.sd_notify_watchdog : heartbeat_to_file

     if config_out_of_date?

Nov 03 '23 19:11 jrafanie

@jrafanie is a great point and something we need to be careful about, kubernetes creates the worker rows in run_single_worker because we can't know the GUID prior to starting the pod (all pods in a replica set have to have the same environment). Kubernetes was already deleting all worker rows that didn't match any running pods.

For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.

For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?

Nov 06 '23 20:11 agrare

For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.

For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?

Yeah, that could work. We've seen it before where pods are really slow to start in our 🔥 environments and such as when CPU throttling is in play. Anything we do should account for possibly significant delays in both situations:

Yeah, think you're right for "creating" rows in systemd/spawn. We do allow for some additional time, using starting timeout... https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/settings.rb#L51 via https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/validation.rb#L5.
For podified, we liveness probes per pod but it's likely that's running independently from the creation of the worker row. Killing it come be hard because it could be in a reboot loop if it's continually getting throttled or delayed. We'll have to think about that.

Nov 06 '23 23:11 jrafanie

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

Feb 19 '24 00:02 miq-bot

Checked commit https://github.com/agrare/manageiq/commit/bcd5b98a2af6111df20f6f9605891a599da0ad0b with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint 5 files checked, 2 offenses detected

app/models/miq_server/worker_management/kubernetes.rb

[ ] :warning: - Line 58, Col 5 - Lint/EmptyBlock - Empty block detected.
[ ] :exclamation: - Line 57, Col 7 - Style/CommentAnnotation - Annotation keywords like TODO should be all upper case, followed by a colon, and a space, then a note describing the problem.

Mar 18 '24 16:03 miq-bot

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

Jun 24 '24 00:06 miq-bot

This pull request is not mergeable. Please rebase and repush.

Aug 16 '24 19:08 miq-bot

manageiq
manageiq copied to clipboard

[WIP] Ensure workers and miq_worker rows match

manageiq manageiq copied to clipboard

[WIP] Ensure workers and miq_worker rows match

manageiq
manageiq copied to clipboard