kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

speed up by removing unneeded dependency of kubespray-defaults on downloads

Open rptaylor opened this issue 2 years ago • 7 comments

What type of PR is this? /kind cleanup

What this PR does / why we need it: At one point in the past it may have been that this dependency of defaults on downloads was needed to set a fact, but as far as I can tell it should not be anymore. The dependency is invoked with skip_downloads: true, so none of the tasks in https://github.com/kubernetes-sigs/kubespray/blob/master/roles/download/tasks/main.yml will be executed - but they do all still get dynamically imported and processed so Ansible wastes a lot of time churning through each one of them.

Which issue(s) this PR fixes: Fixes https://github.com/kubernetes-sigs/kubespray/issues/9279

Special notes for your reviewer: See timing measurements posted below.

Does this PR introduce a user-facing change?: No

rptaylor avatar Sep 15 '22 21:09 rptaylor

Hi @rptaylor. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 15 '22 21:09 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rptaylor Once this PR has been reviewed and has the lgtm label, please assign miouge1 for approval by writing /assign @miouge1 in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 15 '22 21:09 k8s-ci-robot

I used this test.yml play to evaluate the impact of the change. This is a sort of benchmark that represents 2 out of the 12 times that the cluster.yml play invokes the kubespray-defaults role.

- hosts: kube_control_plane
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }

- hosts: k8s_cluster
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }

On a small dev cluster with 6 nodes: at first test.yml took 17 s to run. After removing the dependency it took 4s to run.

On a large cluster with 150 nodes, at first test.yml took 1 minute to run. The output shows all the download and container-engine tasks that were being imported, not doing anything except wasting time:

Thursday 15 September 2022  20:37:06 +0000 (0:00:03.117)       0:00:57.951 **** 
=============================================================================== 
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ------------------------------------------------------------------------------------------------------ 3.28s
kubespray-defaults : set fallback_ips -------------------------------------------------------------------------------------------------------------------------------- 3.26s
kubespray-defaults : Populates no_proxy to all hosts ----------------------------------------------------------------------------------------------------------------- 3.12s
kubespray-defaults : Configure defaults ------------------------------------------------------------------------------------------------------------------------------ 3.08s
prep_download | Set image pull/info command for docker on localhost -------------------------------------------------------------------------------------------------- 2.97s
prep_download | Parse the outputs of the previous commands ----------------------------------------------------------------------------------------------------------- 2.95s
prep_download | Set image pull/info command for containerd and crio -------------------------------------------------------------------------------------------------- 2.93s
prep_download | Set image pull/info command for docker --------------------------------------------------------------------------------------------------------------- 2.93s
container-engine/nerdctl : nerdctl | Download nerdctl ---------------------------------------------------------------------------------------------------------------- 2.93s
prep_download | Register docker images info -------------------------------------------------------------------------------------------------------------------------- 2.90s
prep_download | Create staging directory on remote node -------------------------------------------------------------------------------------------------------------- 2.88s
download | Get kubeadm binary and list of required images ------------------------------------------------------------------------------------------------------------ 2.85s
prep_download | Check that local user is in group or can become root ------------------------------------------------------------------------------------------------- 2.79s
container-engine/crictl : install crictĺ ----------------------------------------------------------------------------------------------------------------------------- 2.77s
download | Download files / images ----------------------------------------------------------------------------------------------------------------------------------- 2.76s
container-engine/nerdctl : nerdctl | Copy nerdctl binary from download dir ------------------------------------------------------------------------------------------- 2.76s
prep_download | Set a few facts -------------------------------------------------------------------------------------------------------------------------------------- 2.72s
prep_download | Set image pull/info command for containerd and crio on localhost ------------------------------------------------------------------------------------- 2.65s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ------------------------------------------------------------------------------------------------------ 1.71s
kubespray-defaults : create fallback_ips_base ------------------------------------------------------------------------------------------------------------------------ 0.85s

After removing the dependency, it took only 15 seconds, and the output shows that only the kubespray-defaults tasks are running:

Thursday 15 September 2022  20:38:54 +0000 (0:00:02.801)       0:00:15.947 **** 
=============================================================================== 
kubespray-defaults : set fallback_ips -------------------------------------------------------------------------------------------------------------------------------- 3.65s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ------------------------------------------------------------------------------------------------------ 3.22s
kubespray-defaults : Configure defaults ------------------------------------------------------------------------------------------------------------------------------ 2.80s
kubespray-defaults : Populates no_proxy to all hosts ----------------------------------------------------------------------------------------------------------------- 2.80s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ------------------------------------------------------------------------------------------------------ 1.80s
kubespray-defaults : create fallback_ips_base ------------------------------------------------------------------------------------------------------------------------ 0.78s
kubespray-defaults : Configure defaults ------------------------------------------------------------------------------------------------------------------------------ 0.19s
kubespray-defaults : create fallback_ips_base ------------------------------------------------------------------------------------------------------------------------ 0.18s
kubespray-defaults : set fallback_ips -------------------------------------------------------------------------------------------------------------------------------- 0.18s
kubespray-defaults : Populates no_proxy to all hosts ----------------------------------------------------------------------------------------------------------------- 0.12s
kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames ------------------------------------------------------------------------------------------ 0.10s
kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames ------------------------------------------------------------------------------------------ 0.08s

So this should make defaults faster by a factor of 4 regardless of cluster size.

rptaylor avatar Sep 15 '22 21:09 rptaylor

@cristicalin or @oomichi could you please take a look?

rptaylor avatar Sep 15 '22 21:09 rptaylor

fatal: [ubuntu20]: FAILED! => {"msg": "The conditional check 'download_localhost' failed. The error was: error while evaluating conditional (download_localhost): 'download_localhost' is undefined\n\nThe error appears to be in '/builds/kargo-ci/kubernetes-sigs-kubespray/roles/kubernetes/preinstall/tasks/0020-verify-settings.yml': line 244, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Stop if download_localhost is enabled but download_run_once is not\n  ^ here\n"}

liupeng0518 avatar Sep 16 '22 02:09 liupeng0518

I guess a lot of task will fail, the issue is that download role ship a lot of version for component as variable..

floryut avatar Sep 16 '22 08:09 floryut

Interesting ... I don't understand how it could be doing anything with the equivalent of when: false on every task but I'll take a closer look.

Oh of course, the role tasks don't execute but the role defaults still get loaded .... need a way to do one without the other... I didn't realize the role was only being included for its default variables but there should be a better way to do that ...

rptaylor avatar Sep 16 '22 18:09 rptaylor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 15 '22 23:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 14 '23 23:01 k8s-triage-robot

/remove-lifecycle rotten

rptaylor avatar Jan 16 '23 20:01 rptaylor

@rptaylor You may rebase master as if the CI is KO this PR won't be merged

floryut avatar Jan 16 '23 20:01 floryut

The download role is actually critical and will require refactoring to properly remove this dependency. In essence, download/defaults/main.yml contains all the versions definitions used in kubespray (>90% anyway) and it is imported throughout the code when we need to do comparisons or download stuff. We also call individual task files from this role in other roles to download files which means this role becomes a dependency of those roles ... in short this part of the code is extremely ugly and requires a lot of cleanup to speed it up properly.

Your goal is noble @rptaylor but I'm afraid the road to get there is much more complicated than just deleting a few lines of code.

cristicalin avatar Jan 17 '23 07:01 cristicalin

Yes, that is what I gathered in my previous comment. It is unusual to have a role that only sets variables. And the playbook is a large number of plays. In principle it should be possible for all the variable definitions to be moved to appropriate group vars with hopefully not too many if/else statements inside the variable definitions but it would be a big project.

rptaylor avatar Jan 17 '23 20:01 rptaylor