salt
salt copied to clipboard
[BUG] Git pillar lock files not being removed.
3006.4
Description
It's been observed that git pillar lock files are not being removed properly.
Dec 21 08:26:23 salt-master[33771]: [WARNING ] git_pillar_global_lock is enabled and update lockfile /var/cache/salt/master/git_pillar/nonprod/_/.git/update.lk is>
Dec 21 08:26:23 salt-master[33771]: [WARNING ] Update lock file is present for git_pillar remote 'nonprod git@***', >
Dec 21 08:35:34 salt-master[60671]: [WARNING ] git_pillar_global_lock is enabled and update lockfile /var/cache/salt/master/git_pillar/nonprod/_/.git/update.lk is>
Dec 21 08:35:34 salt-master[60671]: [WARNING ] Update lock file is present for git_pillar remote 'nonprod git@***', >
Dec 21 08:46:40 salt-master[60671]: [WARNING ] git_pillar_global_lock is enabled and update lockfile /var/cache/salt/master/git_pillar/nonprod/_/.git/update.lk is>
Dec 21 08:46:40 salt-master[60671]: [WARNING ] Update lock file is present for git_pillar remote 'nonprod git@***', >
However, running fileserver.update from the command line did not indicate there was a lock.
$ sudo salt-run fileserver.update
True
$ sudo salt-run cache.clear_git_lock git_pillar type=update
cleared:
Removed update lock for git_pillar remote 'nonprod git@***'
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
- [ ] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [ ] onedir packaging
- [ ] used bootstrap to install
Steps to Reproduce the behavior (Include debug logs if possible and relevant)
Expected behavior Pillars should update without issue.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 3006.4
Python Version:
Python: 3.10.13 (main, Oct 4 2023, 21:54:22) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: 1.3.0
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: 1.7.0
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.13.12
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
Salt Extensions:
SSEAPE: 8.14.1.5
System Versions:
dist: rhel 9.3 Plow
locale: utf-8
machine: x86_64
release: 5.14.0-362.8.1.el9_3.x86_64
system: Linux
version: Red Hat Enterprise Linux 9.3 Plow
Additional context Add any other context about the problem here.
The salt-master's configuration file only contains log_level_logfile: debug, all git fs settings are in raas, and this appears to be sseapi related. Will talk with raas team to investigate this further, as to how git fs is being used with regular open salt and from raas.
Chatted with the raas team and they don't use GitFS. Found the GitFS settings in a separate file masterXX_gitfs.conf which leverages azure
Q: how is azure accessed, are they using the azure salt extension ?
Have a new master log file with the lock error's. I am seeing a large number of Traceback errors in the file due to jinja2 rendering errors, this may be due to the pillar data is not accessible from GitFS, or some other issue, but investigating if the error paths on the GitFS access are not releasing the lock files, and that is why they still might be present (there may be an error path that was overlooked in the implementation).
See probable cause, for example: process 50973 obtains a lock but never releases it, and gets killed, SIGTERM, and then another process attempts to obtain lock but finds process 509732 still has it, but it is dead. Note the long time difference between obtaining lock and SIGTERM, would have thought it would have been released soon after obtaining it, probable two bugs, should not hold locks that long, and SIGTERM should check and release resources held by a process like lock files.
Obtain lock
2024-01-11 09:20:26,218 [salt.utils.gitfs :999 ][DEBUG ][50973] Set update lock for gitfs remote '[email protected]:v3/Vizientinc/Hosting/SaltStack'
SIGTERM approx. 16 hours later
2024-01-12 01:15:41,521 [salt.utils.process:1070][DEBUG ][50973] FileserverUpdate received a SIGTERM. Exiting
Finding lock held by dead process
2024-01-12 01:16:25,647 [salt.utils.gitfs :961 ][WARNING ][3200] gitfs_global_lock is enabled and update lockfile /var/cache/salt/master/gitfs/work/NlJQs6Pss_07AugikCrmqfmqEFrfPbCDBqGLBiCd3oU=/_/update.lk is present for gitfs remote '[email protected]:v3/Vizientinc/Ho sting/SaltStack'. Process 50973 obtained the lock but this process is not running. The update may have been interrupted. If using multi-master with shared gitfs cache, the lock may have been obtained by another master.
The current testing for gitfs is only relying on _master_lock, and exercising it
_master_lock = multiprocessing.Lock()
and use acquire and release. this doesn't really test the locking in the GitProvider class with PyGit2 and GitPython lock & clear_lock mechanism but goes directly to underlying acquire and release,n should be testing the methods
Have changes and tests in PR https://github.com/saltstack/salt/pull/65937 for handling GitFS lock files, and droppings from when the process is terminated, getting cleaned up.
closing since PR https://github.com/saltstack/salt/pull/65937 is merged