Unnecessary locking while installing roles
ISSUE TYPE
- Bug Report
SUMMARY
Having a template linked with a project NOT marked for update on launch.
If this project have the file roles/requirements.yml it is forced to run the update playbook.
In f64d0dde5 were introduced some tags to only do the roles install, avoiding having to update the project (because it is marked not to).
At this point, only the block to run ansible-galaxy will be executed, reading the file PROJECT_PATH/roles/requirements.yml and installing the roles in JOB_PRIVATE_DATA_DIR/requirements_roles.
The update job is scheduled.
The run() method calls the pre_run_hook() of the RunProjectUpdate task which tries to adquire a lock for the project folder:
class RunProjectUpdate(BaseTask):
...
def pre_run_hook(self, instance, private_data_dir):
...
self.acquire_lock(instance)
This lock will be a file like /var/lib/awx/projects/_1__project_foo205016.lock, avoiding changes in the project dir while some project update job is running.
But this case is not going to modify the project path, only download roles in the tempdir, so I think the lock is unnecessary.
The file pointed by ansible-galaxy should be the copy of requirements.yml in the tempdir, avoding race conditions when the project is updated.
I am not sure if the copy of the project folder to the tempdir is done after or before the execution of the project_update.yml playbook.
ENVIRONMENT
- AWX version: devel
STEPS TO REPRODUCE
Create project with a roles/requirements.yml file.
Mark the project as not be updated on launch.
Create a template linked to that project.
Run several jobs of that template.
EXPECTED RESULTS
Jobs should run concurrently.
ACTUAL RESULTS
Jobs are run sequentially because they are stuck trying to get the lock to run the project_update.yml template.
ADDITIONAL INFORMATION
And maybe the update project playbook could download the roles to the project dir.
Then, when that dir is copied to the tempdir, ansible-galaxy maybe could skip some of them because they have the required version and the task could run faster
Thank you for your well-written arguments.
I am not sure if the copy of the project folder to the tempdir is done after or before the execution of the project_update.yml playbook.
After. This is a complication. I want to voice my agreement that if the project update's job_tags do not include any update_* tag, then:
- the folder copy could happen before the project playbook run
- it could operate completely within the job's tmp dir
- it would have no need to obtain the lock.
And maybe the update project playbook could download the roles to the project dir.
We want to do almost exactly that, issue https://github.com/ansible/awx/issues/5518 covers it. There are complications with using the project directory for a specific project, but some other subdirectory should work, specifics TBD. @fosterseth has been doing work reducing the locking. The proposal here is a possibility, but may be less necessary after a better caching solution for roles is worked out.
@AlanCoding does your linked issue (#5518) cover this issue (i.e., should this be dup'd away)? If not, what actually needs to be done here that's not covered in the other issue?
No, that doesn't fully cover this. This suggestion is much more nuanced.
If we do #5518, the need for this will be greatly reduced. I'll adjust tags somewhat to reflect that. We can reduce local locking from project syncs, but priority of it should be reconsidered after other improvements to roles/collections installs wrap up, and maybe scale testing will be relevant as well.
This has become a big problem for us when running multiple (~140) jobs from the same project.
Strangely we see this behavior even for projects that do not include requirements.yml of any sort.
Jobs that are supposed to run concurrently effectively are executed sequentially/with increasing delay, even though each job only needs to perform a read operation on the project dir.
Our setup:
- AWX 24.6.1 on Openshift
- /var/lib/awx/projects dir is NFS share mounted from persistent volume claim with
spec: accessModes: - ReadWriteMany - Project Update on launch deactivated
- Job Template Run concurrently activated
- No
./requirementsor./(collections|roles)/requirements.yml
awx-task logs contain many messages like:
2025-04-28 18:01:49,817 INFO [c6689856788446c9999b725e581acbe1] awx.main.tasks.jobs exception acquiring lock /var/lib/awx/projects/_2329__project_name.lock: [Errno 11] Resource temporarily unavailable
2025-04-28 18:02:55,222 INFO [c6689856788446c9999b725e581acbe1] awx.main.tasks.jobs Job 1267079 waited 65.40522575378418 to acquire lock for local source tree for path /var/lib/awx/projects/_2329__project_name.lock.
...
2025-04-29 05:15:18,231 INFO [5cc7f9f1995541a89e898981d84c45f7] awx.main.tasks.jobs Job 1268281 waited 6195.004686594009 to acquire lock for local source tree for path /var/lib/awx/projects/_2329__project_name.lock.
I have just noticed that *.lock files are not being removed after jobs have finished. Is this by design?
$ ls -l /var/lib/awx/projects/
total 16
drwxr-xr-x. 3 awx root 4096 Apr 30 11:34 _522__awxtesting
-rwxr-xr-x. 1 awx root 0 Apr 30 11:42 _522__awxtesting.lock
drwxr-xr-x. 13 awx root 4096 Apr 24 07:11 _523__awx_tools
-rwxr-xr-x. 1 awx root 0 Apr 30 11:41 _523__awx_tools.lock
Does anyone have an idea what else could cause above behavior?