[BUG] After Upgrading, Windows Minion RAM Usage 7GB+
Description
After upgrading to 3005 from 3002.9 we are seeing heavy Salt minion memory usage across all our Windows servers. On this server with enough memory (96GB), total RAM usage of the Salt Minion is around 7GB. On more memory constrained Windows instances, the Salt Minion eats nearly all the available memory and starts paging. For example, on a 4GB server, Salt RAM usage is around 3GB.
Get-Process | ? ProcessName -eq python| select @{Name="RAMUsage";Expression={$_.WorkingSet64 / 1MB}} | measure -Sum RAMUsage
Count : 55
Average :
Sum : 7331.484375
Maximum :
Minimum :
Property : RAMUsage
On this server there are 55 salt-minion processes, each using about 130MB of ram each:
Get-Process | ? ProcessName -eq python| select name, id, starttime, @{Name="RAMUsage";Expression={$_
.WorkingSet64 / 1MB}}
Name Id StartTime RAMUsage
---- -- --------- --------
python 1008 9/11/2022 2:49:49 PM 69.3203125
python 1784 9/12/2022 1:58:43 AM 136.7578125
python 2084 9/11/2022 2:20:25 PM 147.26953125
python 2396 9/13/2022 2:54:43 PM 137.4765625
python 3344 9/11/2022 2:40:27 PM 69.15625
python 3908 9/11/2022 7:44:28 PM 136.02734375
python 3980 9/12/2022 11:14:46 AM 136.01171875
python 4632 9/11/2022 2:07:02 PM 147.6015625
python 4964 9/12/2022 5:21:47 PM 136.1171875
python 5052 9/11/2022 1:36:55 PM 136.96875
python 6884 9/13/2022 5:56:07 PM 137.84375
python 7472 9/11/2022 2:30:02 PM 69.234375
python 7824 9/11/2022 1:36:21 PM 96.453125
python 8440 9/11/2022 10:49:07 PM 136.05078125
python 8500 9/12/2022 4:59:36 AM 133.9453125
python 9096 9/11/2022 2:54:43 PM 68.984375
python 10164 9/11/2022 2:44:14 PM 69.44921875
python 11224 9/11/2022 2:56:26 PM 135.015625
python 11256 9/11/2022 4:37:44 PM 135.6875
python 11328 9/14/2022 3:23:16 PM 138.13671875
python 11424 9/14/2022 6:10:12 AM 137.94921875
python 11588 9/12/2022 11:36:54 PM 134.8203125
python 11924 9/14/2022 9:11:45 AM 136.58984375
python 12084 9/12/2022 2:15:28 PM 136.20703125
python 12652 9/14/2022 3:09:44 AM 138.05859375
python 13232 9/12/2022 8:08:16 AM 135.95703125
python 13348 9/12/2022 8:30:22 PM 135.6484375
python 13364 9/13/2022 8:59:03 PM 137.234375
python 13444 9/13/2022 2:40:25 AM 136.33203125
python 13860 9/13/2022 5:47:04 AM 137.9609375
python 14012 9/13/2022 8:47:31 AM 137.5078125
python 14044 9/14/2022 12:05:50 AM 137.703125
python 14560 9/13/2022 11:53:10 AM 137.48046875
python 14832 9/14/2022 12:14:59 PM 137.90234375
python 14964 9/15/2022 9:47:06 AM 139.140625
python 15376 9/15/2022 3:40:28 AM 138.2890625
python 15592 9/15/2022 6:57:40 PM 142.01171875
python 16932 9/14/2022 6:27:59 PM 138.32421875
python 17332 9/14/2022 9:28:52 PM 138.66796875
python 17432 9/15/2022 6:46:26 AM 138.46484375
python 17440 9/15/2022 12:53:42 PM 139.921875
python 17464 9/15/2022 12:31:12 AM 136.81640625
python 17700 9/16/2022 10:25:45 AM 148.3984375
python 18484 9/15/2022 10:01:28 PM 141.93359375
python 18524 9/15/2022 3:55:10 PM 141.26953125
python 19492 9/16/2022 7:35:29 PM 151.53125
python 19716 9/16/2022 7:17:40 AM 145.18359375
python 19808 9/16/2022 1:03:17 AM 143.23828125
python 19904 9/16/2022 1:30:48 PM 148.578125
python 19916 9/16/2022 4:11:25 AM 143.73046875
python 20848 9/17/2022 4:52:24 AM 155.0625
python 20952 9/16/2022 4:31:28 PM 149.3046875
python 21808 9/17/2022 1:44:41 AM 152.70703125
python 22220 9/16/2022 10:41:58 PM 151.81640625
python 22304 9/17/2022 7:56:50 AM 156.234375
Get-Date
Saturday, September 17, 2022 10:18:59 AM
Setup
Windows 2016 instances (both bare metal and virtualized via Hyper-V). Version 3005, via the MSI installer.
Steps to Reproduce the behavior
Nothing interesting in the logs.
Minion configuration is close to defaults:
schedule:
__mine_interval: {enabled: true, function: mine.update, jid_include: true, maxrunning: 2,
minutes: 60, return_job: false, run_on_start: true}
grains_cache: true
# Not documented.
# https://github.com/saltstack/salt/issues/48773
disable_grains:
- esxi
disable_modules:
- vsphere
# Just annoying:
- boto3_elasticsearch
- victorops
- ifttt
- pushbullet
- zfs
Expected behavior
RAM usage should be constrained.
Screenshots
These servers are Core instances, so not much of a UI. This is a screenshot from a Hypervisor (96GB of ram, 7GB used by Salt).

Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)C:\Program Files\Salt Project\Salt\bin\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
Salt Version:
Salt: 3005
Dependency Versions:
cffi: 1.14.6
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: 4.0.7
gitpython: Not Installed
Jinja2: 3.1.0
libgit2: Not Installed
M2Crypto: Not Installed
Mako: 1.1.4
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.10.1
pygit2: Not Installed
Python: 3.8.13 (tags/v3.8.13:ea67321, Aug 22 2022, 17:16:10) [MSC v.1929 64 bit (AMD64)]
python-gnupg: 0.4.8
PyYAML: 5.4.1
PyZMQ: 19.0.0
smmap: 4.0.0
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.2
System Versions:
dist:
locale: cp1252
machine: AMD64
release: 2016Server
system: Windows
version: 2016Server 10.0.14393 SP0 Multiprocessor Free
Additional context
All Windows instances appear to be impacted.
A little more context before I start restarting minions:
salt-call --local saltutil.running
local:
(Same server as above)
Did you upgrade through each major version, or jump straight to 3005?
If you install setproctitle you may get better process names to aid investigation.
This was a clean uninstall/install to move to the new directory structure (so complete removal of the C:\salt directory). A direct jump to 3005 from 3002.9. No cache should have carried over, there were new keys, etc. This also happened with a new Windows instance that was just just built out (never had salt installed).
The salt-minion installer broke when attempting to upgrade (spinned for days), so manual removal and reinstall (we also moved from the EXE to MSI, the later is much easier to debug).
How would one use setproctitle under Windows?
https://github.com/dvarrazzo/py-setproctitle#module-status Note that on Windows there is no way to change the process string: what the module does is to create a Named Object whose value can be read using a tool such as Process Explorer (contribution of a more useful tool to be used together with setproctitle would be well accepted).
How would one use
setproctitleunder Windows?
As the quote says, probably by using Process Explorer instead of the Windows Process Manager.
~So setproctitle is already being used by salt?~
Ah, found it, it creates a handle via a system mutex.
Schedule(name=Highstate, jid=20220920040540366404)
Schedule(name=Highstate, jid=20220920070650373582)
Looks like the processes used for scheduled runs keep running after completion. I don't see these jid's being reported back to the salt-master.
There was a change in schedule config at some point, though I don't remember specifically when. Might be worth a check:
Are you setting the schedule via a state or pillar?
Ensure job_args is a list and job_kwargs is a dict, and possibly purge the schedule and re-create.
We are setting schedule via pillar:
schedule:
Highstate:
function: state.highstate
minutes: 180
splay: 600
(you can tell how old this is by usage of state.highstate - this file hasn't been changed in 5 years, existing before we moved to gitfs! 😸)
I want to say this looks valid per https://docs.saltproject.io/salt/user-guide/en/latest/topics/scheduler.html
Highstate jobs are still executing at the expected 180 minute interval on these nodes (I can see the job return data on the salt-master). I'm not sure where the Schedule(name=Highstate, jid=20220920040540366404) jid is, there's no job return for that id. The job isn't in the master's cache (although, older highstate jobs are for this node, judging when a service.restart salt-minion job was executed last night).
Hi @Silvenga we are trying to reproduce your process leak and are having no luck. Can you help us by telling what states and modules your using? Do you know what your doing when salt minion is making the process that is hanging? For example are you running some kind of state file around the same time the hanging process spawns?
what states and modules your using
Does this help?
Highstate output for a Windows node (Hyper-V host)
----------
ID: Ensure enhanced telemetry is disabled
Function: reg.present
Name: HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\DataCollection
Result: True
Comment: AllowTelemetry in HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\DataCollection is already present
Started: 20:59:29.078658
Duration: 15.628 ms
Changes:
----------
ID: Ensure Remote management via rdp is enabled
Function: rdp.enabled
Result: True
Comment: RDP is enabled
Started: 20:59:29.094286
Duration: 631.372 ms
Changes:
----------
ID: Ensure hostname is correct
Function: system.hostname
Name: gh2
Result: True
Comment: Hostname is already set to 'gh2'
Started: 20:59:29.741304
Duration: 79.351 ms
Changes:
----------
ID: Ensure disk performance counters are enabled
Function: cmd.run
Name: diskperf -Y
Result: True
Comment: onlyif condition is false
Started: 20:59:29.820655
Duration: 8587.052 ms
Changes:
----------
ID: Ensure Server Manager performace counters are enabled at startup
Function: cmd.run
Name: Enable-ScheduledTask -TaskName 'Server Manager Performance Monitor' -TaskPath '\Microsoft\Windows\PLA\'
Result: True
Comment: unless condition is true
Started: 20:59:38.407707
Duration: 1934.249 ms
Changes:
----------
ID: Ensure Server Manager performace counters are running
Function: cmd.run
Name: logman start "Server Manager Performance Monitor"
Result: True
Comment: unless condition is true
Started: 20:59:40.341956
Duration: 829.823 ms
Changes:
----------
ID: Ensure the pcm Telegraf service grain is set
Function: grains.list_present
Name: telegraf
Result: True
Comment: Value pcm is already in grain telegraf
Started: 20:59:41.171779
Duration: 0.0 ms
Changes:
----------
ID: Ensure Timezone
Function: timezone.system
Name: America/Chicago
Result: True
Comment: Timezone America/Chicago already set
Started: 20:59:41.171779
Duration: 15.626 ms
Changes:
----------
ID: Ensure minion performance options are set
Function: file.managed
Name: C:\ProgramData\Salt Project\Salt\conf\minion.d\performance.conf
Result: True
Comment: File C:\ProgramData\Salt Project\Salt\conf\minion.d\performance.conf is in the correct state
Started: 20:59:41.187405
Duration: 219.879 ms
Changes:
----------
ID: Ensure the SaltStack minion is installed
Function: pkg.installed
Name: salt-minion-py3
Result: True
Comment: All specified packages are already installed and are at the desired version
Started: 20:59:41.425375
Duration: 6475.776 ms
Changes:
----------
ID: Ensure MP exclusion for [C:\Program Files\Salt Project\Salt] exists
Function: cmd.script
Name: salt://infra/salt/files/update-mp-exclusion.ps1
Result: True
Comment:
Started: 20:59:47.901151
Duration: 1600.777 ms
Changes:
----------
ID: Ensure smartmontools are installed
Function: pkg.installed
Name: smartmontools
Result: True
Comment: All specified packages are already installed and are at the desired version
Started: 20:59:49.517557
Duration: 174.049 ms
Changes:
----------
ID: Ensure smartd.conf is correct
Function: file.managed
Name: C:\Program Files\smartmontools\bin\smartd.conf
Result: True
Comment: File C:\Program Files\smartmontools\bin\smartd.conf is in the correct state
Started: 20:59:49.691606
Duration: 220.005 ms
Changes:
----------
ID: Ensure the smartd service is installed
Function: cmd.run
Name: & "C:\Program Files\smartmontools\bin\smartd.exe" install -c "C:\Program Files\smartmontools\bin\smartd.conf"
Result: True
Comment: No changes detected
Started: 20:59:50.285359
Duration: 0.0 ms
Changes:
----------
ID: Ensure smartd is enabled and is running
Function: service.running
Name: smartd
Result: True
Comment: The service smartd is already running
Started: 20:59:50.285359
Duration: 193.729 ms
Changes:
----------
ID: Ensure duplicati dependencies are installed
Function: pkg.installed
Name: vcredist-2015
Result: True
Comment: All specified packages are already installed
Started: 20:59:50.479088
Duration: 109.372 ms
Changes:
----------
ID: Ensure duplicati is installed
Function: pkg.installed
Name: duplicati
Result: True
Comment: All specified packages are already installed and are at the desired version
Started: 20:59:50.588460
Duration: 173.041 ms
Changes:
----------
ID: Ensure local db directory exists
Function: file.directory
Name: C:\Program Files\Duplicati 2\local-configs
Result: True
Comment: The directory C:\Program Files\Duplicati 2\local-configs is in the correct state
Started: 20:59:50.761501
Duration: 15.624 ms
Changes:
----------
ID: Ensure nightly backup script exists
Function: file.managed
Name: C:\Program Files\Duplicati 2\local-configs\backup-nightly.ps1
Result: True
Comment: File C:\Program Files\Duplicati 2\local-configs\backup-nightly.ps1 is in the correct state
Started: 20:59:50.777125
Duration: 214.42 ms
Changes:
----------
ID: Ensure .Net 2.0 and 3.5 are installed
Function: win_servermanager.installed
Name: Net-Framework-Core
Result: True
Comment: Installed the following:
- Net-Framework-Core
Started: 20:59:50.991545
Duration: 3634.698 ms
Changes:
----------
ID: Ensure Icinga2 is installed
Function: pkg.installed
Name: icinga2
Result: True
Comment: All specified packages are already installed and are at the desired version
Started: 20:59:54.626243
Duration: 253.537 ms
Changes:
----------
ID: Ensure Icinga2 Powershell modules is installed
Function: file.managed
Name: C:\Program Files\WindowsPowerShell\Modules\Icinga2Agent\Icinga2Agent.psm1
Result: True
Comment: File C:\Program Files\WindowsPowerShell\Modules\Icinga2Agent\Icinga2Agent.psm1 is in the correct state
Started: 20:59:54.881778
Duration: 96.994 ms
Changes:
----------
ID: Ensure check_windows_updates.ps1 is installed
Function: file.managed
Name: C:\Program Files\ICINGA2\sbin\check_windows_updates.ps1
Result: True
Comment: File C:\Program Files\ICINGA2\sbin\check_windows_updates.ps1 is in the correct state
Started: 20:59:54.978772
Duration: 42.074 ms
Changes:
----------
ID: Ensure check_pending_reboot.ps1 is installed
Function: file.managed
Name: C:\Program Files\ICINGA2\sbin\check_pending_reboot.ps1
Result: True
Comment: File C:\Program Files\ICINGA2\sbin\check_pending_reboot.ps1 is in the correct state
Started: 20:59:55.020846
Duration: 187.512 ms
Changes:
----------
ID: Remove conflicting icinga2 config file apt.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\apt.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\apt.conf is not present
Started: 20:59:55.208358
Duration: 15.769 ms
Changes:
----------
ID: Remove conflicting icinga2 config file downtimes.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\downtimes.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\downtimes.conf is not present
Started: 20:59:55.224127
Duration: 1.069 ms
Changes:
----------
ID: Remove conflicting icinga2 config file hosts.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\hosts.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\hosts.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file satellite.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\satellite.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\satellite.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file templates.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\templates.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\templates.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file users.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\users.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\users.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file app.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\app.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\app.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file commands.confgroups.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\commands.confgroups.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\commands.confgroups.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file notifications.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\notifications.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\notifications.conf is not present
Started: 20:59:55.225196
Duration: 0.0 ms
Changes:
----------
ID: Remove conflicting icinga2 config file services.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\services.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\services.conf is not present
Started: 20:59:55.225196
Duration: 15.637 ms
Changes:
----------
ID: Remove conflicting icinga2 config file timeperiods.conf
Function: file.absent
Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\timeperiods.conf
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\timeperiods.conf is not present
Started: 20:59:55.240833
Duration: 0.0 ms
Changes:
----------
ID: Complete Installation
Function: file.managed
Name: C:\ProgramData\icinga2\etc\icinga2\salt-managed
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\salt-managed is in the correct state
Started: 20:59:55.240833
Duration: 15.645 ms
Changes:
----------
ID: Setup Icinga2
Function: cmd.script
Result: True
Comment: State was not run because none of the onchanges reqs changed
Started: 20:59:55.256478
Duration: 0.0 ms
Changes:
----------
ID: Ensure Icinga2 CA is trusted
Function: win_pki.import_cert
Name: salt://infra/icinga2/files/ca.crt
Result: True
Comment: Certificate 'ED4B02846DA3ECDFCD1BC2D0D3353E8AACEAFB08' already contained in store: Cert:\LocalMachine\Root
Started: 20:59:55.256478
Duration: 3001.838 ms
Changes:
----------
ID: Icinga2 failed Installation
Function: file.managed
Name: C:\ProgramData\icinga2\etc\icinga2\salt-managed
Result: True
Comment: State was not run because onfail req did not change
Started: 20:59:58.258316
Duration: 0.0 ms
Changes:
----------
ID: Restart Icinga2
Function: service.running
Name: icinga2
Result: True
Comment: The service icinga2 is already running
Started: 20:59:58.273943
Duration: 31.445 ms
Changes:
----------
ID: Ensure host definition is correct
Function: file.serialize
Name: C:\ProgramData\icinga2\etc\icinga2\director-definition.json
Result: True
Comment: File C:\ProgramData\icinga2\etc\icinga2\director-definition.json is in the correct state
Started: 20:59:58.305388
Duration: 31.061 ms
Changes:
----------
ID: Ensure host has been declared in the Icinga2 director
Function: cmd.script
Name: salt://infra/icinga2/files/director-windows.ps1
Result: True
Comment:
Started: 20:59:58.352089
Duration: 1703.114 ms
Changes:
----------
ID: Ensure negative DNS resolutions are not cached
Function: cmd.run
Name: C:\Windows\System32\reg.exe add HKLM\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters /v MaxNegativeCacheTtl /t REG_DWORD /d 0 /f
Result: True
Comment: unless condition is true
Started: 21:00:00.055203
Duration: 107.085 ms
Changes:
----------
ID: Ensure DNS Client is restarted
Function: module.wait
Name: service.restart
Result: True
Comment: State was not run because none of the onchanges reqs changed
Started: 21:00:00.162288
Duration: 0.0 ms
Changes:
----------
ID: Ensure Salt Managed Config Exists
Function: file.managed
Name: C:\Program Files\Consul\Config\salt.json
Result: True
Comment: File C:\Program Files\Consul\Config\salt.json is in the correct state
Started: 21:00:00.162288
Duration: 46.259 ms
Changes:
----------
ID: Ensure Consul is installed
Function: pkg.installed
Name: consul
Result: True
Comment: All specified packages are already installed and are at the desired version
Started: 21:00:00.208547
Duration: 245.748 ms
Changes:
----------
ID: Ensure Telegraf is extracted
Function: archive.extracted
Name: C:\Program Files\Telegraf\1.15.2
Result: True
Comment: Path C:\Program Files\Telegraf\1.15.2\telegraf-1.15.2 exists
Started: 21:00:00.454295
Duration: 0.0 ms
Changes:
----------
ID: Ensure current Telegraf version exists
Function: file.symlink
Name: C:\Program Files\Telegraf\current
Result: True
Comment: Set ownership of symlink C:\Program Files\Telegraf\current to mlopez
Started: 21:00:00.454295
Duration: 15.619 ms
Changes:
----------
ownership:
mlopez
----------
ID: Telegraf config
Function: file.managed
Name: C:\Program Files\Telegraf\telegraf.conf
Result: True
Comment: File C:\Program Files\Telegraf\telegraf.conf is in the correct state
Started: 21:00:00.469914
Duration: 282.547 ms
Changes:
----------
ID: Telegraf servcies config
Function: file.directory
Name: C:\Program Files\Telegraf\telegraf.d
Result: True
Comment: The directory C:\Program Files\Telegraf\telegraf.d is in the correct state
Started: 21:00:00.752461
Duration: 15.624 ms
Changes:
----------
ID: Telegraf Windows config
Function: file.managed
Name: C:\Program Files\Telegraf\telegraf.d\inputs.conf
Result: True
Comment: File C:\Program Files\Telegraf\telegraf.d\inputs.conf is in the correct state
Started: 21:00:00.768085
Duration: 246.454 ms
Changes:
----------
ID: Ensure incorrectly configured telegraf services are removed
Function: cmd.run
Name: & "C:\Program Files\Telegraf\current\telegraf.exe" -service stop ; & "C:\Program Files\Telegraf\current\telegraf.exe" -service uninstall
Result: True
Comment: onlyif condition is true
unless condition is true
Started: 21:00:01.019533
Duration: 1100.691 ms
Changes:
----------
ID: Ensure the Telegraf service is installed
Function: cmd.run
Name: & "C:\Program Files\Telegraf\current\telegraf.exe" -service install --config "C:\Program Files\Telegraf\telegraf.conf" --config-directory "C:\Program Files\Telegraf\telegraf.d"
Result: True
Comment: unless condition is true
Started: 21:00:02.120224
Duration: 474.633 ms
Changes:
----------
ID: Ensure the Telegraf service is enabled and running
Function: service.running
Name: telegraf
Result: True
Comment: The service telegraf is already running
Started: 21:00:02.594857
Duration: 32.377 ms
Changes:
----------
ID: Ensure vcredist-2015 is installed
Function: pkg.installed
Name: vcredist-2015
Result: True
Comment: All specified packages are already installed
Started: 21:00:02.638893
Duration: 93.765 ms
Changes:
----------
ID: Ensure PCM is installed
Function: pkg.installed
Name: pcm
Result: True
Comment: All specified packages are already installed and are at the desired version
Started: 21:00:02.732658
Duration: 236.392 ms
Changes:
----------
ID: Ensure Telegraf for pcm is configured
Function: file.managed
Name: C:\Program Files\Telegraf\telegraf.d\salt-pcm.conf
Result: True
Comment: File C:\Program Files\Telegraf\telegraf.d\salt-pcm.conf is in the correct state
Started: 21:00:02.970052
Duration: 20.011 ms
Changes:
----------
ID: Ensure Telegraf for pcm running
Function: service.running
Name: telegraf
Result: True
Comment: The service telegraf is already running
Started: 21:00:02.991114
Duration: 31.003 ms
Changes:
----------
ID: Ensure Telegraf for hyper-v is configured
Function: file.managed
Name: C:\Program Files\Telegraf\telegraf.d\salt-hyper-v.conf
Result: True
Comment: File C:\Program Files\Telegraf\telegraf.d\salt-hyper-v.conf is in the correct state
Started: 21:00:03.022117
Duration: 31.266 ms
Changes:
----------
ID: Ensure Telegraf for hyper-v running
Function: service.running
Name: telegraf
Result: True
Comment: The service telegraf is already running
Started: 21:00:03.053383
Duration: 31.245 ms
Changes:
----------
ID: Ensure HyperV features are installed
Function: win_servermanager.installed
Result: True
Comment: The following features are already installed:
- Hyper-V
- RSAT-Hyper-V-Tools
- RSAT-AD-PowerShell
Started: 21:00:03.084628
Duration: 1146.602 ms
Changes:
----------
ID: Ensure migration is enabled
Function: cmd.run
Name: Enable-VMMigration
Result: True
Comment: unless condition is true
Started: 21:00:04.231230
Duration: 1110.606 ms
Changes:
----------
ID: Ensure migration auth is kerberos
Function: cmd.run
Name: Set-VMHost -VirtualMachineMigrationAuthenticationType Kerberos
Result: True
Comment: unless condition is true
Started: 21:00:05.357467
Duration: 852.183 ms
Changes:
----------
ID: Ensure migration performance is compression
Function: cmd.run
Name: Set-VMHost -VirtualMachineMigrationPerformanceOption Compression
Result: True
Comment: unless condition is true
Started: 21:00:06.209650
Duration: 913.1 ms
Changes:
----------
ID: Ensure enchanced session mode is enabled
Function: cmd.run
Name: Set-VMHost -EnableEnhancedSessionMode $True
Result: True
Comment: unless condition is true
Started: 21:00:07.122750
Duration: 997.013 ms
Changes:
----------
ID: Ensure schedule type is core
Function: cmd.run
Name: bcdedit /set HypervisorSchedulerType core
Result: True
Comment: unless condition is true
Started: 21:00:08.119763
Duration: 816.798 ms
Changes:
----------
ID: Ensure the Telegraf grain is set
Function: grains.list_present
Name: telegraf
Result: True
Comment: Value hyper-v is already in grain telegraf
Started: 21:00:08.936561
Duration: 0.0 ms
Changes:
----------
ID: Ensure host allows credential delegration
Function: cmd.script
Name: salt://stack/hyper-v/files/ensure-delegration.ps1
Result: True
Comment:
Started: 21:00:08.936561
Duration: 2799.39 ms
Changes:
----------
ID: Ensure machine is domain joined
Function: system.join_domain
Name: <blah>
Result: True
Comment: Computer already added to '<blah>'
Started: 21:00:11.735951
Duration: 110.529 ms
Changes:
Summary for local
-------------
Succeeded: 69 (changed=1)
Failed: 0
-------------
Total states run: 69
Total run time: 42.277 s
Do you know what your doing when salt minion is making the process that is hanging?
I don't think any processes are hanging, at least there's no impact to any operations. The process details suggest it's a scheduled highstate.
For example are you running some kind of state file around the same time the hanging process spawns?
It appears this occurs just with Highstate, no other execution are normally occurring on these servers.
Aside, I am hoping these servers will be upgraded to Windows Server 2022 soon'ish. There's already a migration to upgrade to Ubuntu 22.04 - while refactoring the existing Salt states (migrating to onedir and removing some legacy workarounds for random Salt problems over the may years).
@Silvenga thanks for the ^ info. Can I also see your minion config? Are you sure you did not make any changes to your states? Salt is eating up some much ram because it has a lot of sub processes for some reason. My guess is your cmd states are not exiting right.
The combined minion configuration for a Hyper-V Windows node. Nothing super exotic, the disable_modules are from some performance testing done years ago, so they might not be useful anymore. The restart-minion was from the workaround above that's only applied to Windows nodes. I don't remember why grains_cache was enabled.
This is mostly the standard configuration deployed to both Linux and Windows nodes.
master: <blah>
id: <blah>
schedule:
__mine_interval: {enabled: true, function: mine.update, jid_include: true, maxrunning: 2,
minutes: 60, return_job: false, run_on_start: true}
restart-minion:
args: [salt-minion]
cron: 28 6 * * *
enabled: true
function: service.restart
jid_include: true
maxrunning: 1
name: restart-minion
grains_cache: true
# Not documented.
# https://github.com/saltstack/salt/issues/48773
disable_grains:
- esxi
disable_modules:
- vsphere
# Just annoying:
- boto3_elasticsearch
- victorops
- ifttt
- pushbullet
- zfs
My guess is your cmd states are not exiting right.
Oh, that's interesting. I'll see if I can get a process tree before the minion is restarted. I should note that there are no processes hanging around after the service restart (e.g. process count is mostly constant on these nodes as measured by a Telegraf agent). Icinga also monitors the number of processes, and hasn't detected anything unconstrained.
I'm sorry to say that some of these Hyper-V nodes aren't exactly rebooted on a monthly cadence like we should - mostly because we don't have N+1 resources in some of the Hyper-V clusters to facilitate live migrations during rolling reboots. So, what I mean to say, some of these nodes have quite an uptime, suggesting the process count is constrained.
EDIT: 66 days of uptime is the current record across the clusters.
Thanks for the fast response, I'll keep looking if I can reproduce.
Thanks for the fast response, I'll keep looking if I can reproduce.
TBH, if you can't repo, I think the onus should be on the reporter (me), we should upgrade to 3006 just to confirm this hasn't been accidently fixed or something. From there, I'm hoping that this stuff get's moved over to the new Salt cluster with re-written states, which includes migrating to Windows Server 2022.
@Silvenga if would be great if you could find that state causing this problem. I would appreciate the help.
Closing Issues due to inactivity. Feel free to re-open if deemed necessary.