scheduler-plugins icon indicating copy to clipboard operation
scheduler-plugins copied to clipboard

Initial commit Network-Aware framework

Open jpedro1992 opened this issue 3 years ago • 17 comments

What type of PR is this?

This is the initial PR for the network-aware framework approved in this KEP.

The PR includes two custom resources (AppGroup and NetworkTopology), three plugins (Networkaware), and one controller (AppGroup).

/kind feature

What this PR does / why we need it:

This proposal introduces an end-to-end solution to model/weight a cluster's network latency and topological information, and leverage that to better schedule latency- and bandwidth-sensitive workloads. Further information is available in our merged KEP.

Special notes for your reviewer:

After several interactions in the KEP PR, we have updated our design and we submit the first version of the proposed network-aware framework. All feedback is welcome. Thanks.

Cc: @Huang-Wei

jpedro1992 avatar Feb 28 '22 15:02 jpedro1992

Hi @jpedro1992. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 28 '22 15:02 k8s-ci-robot

cc: @ravisantoshgudimetla @ingvagabund and @damemi

wangchen615 avatar Feb 28 '22 16:02 wangchen615

/ok-to-test

denkensk avatar Mar 03 '22 03:03 denkensk

/retest

jpedro1992 avatar Mar 03 '22 09:03 jpedro1992

The integration tests fail because of the "api-approved.kubernetes.io" in the two added CRDs (AppGroup & NetworkTopology). I was able to run my integration tests successfully locally after I manually edited the "api-approved.kubernetes.io". So, how should I proceed to take care of this issue for this PR since both CRDs are not approved yet? What is the proper way to add novel CRDs and to solve the issue raised by "api-approved.kubernetes.io"? Thanks.

CRD error:

crd error

After manually editing "api-approved.kubernetes.io":

integration test OK

jpedro1992 avatar Mar 03 '22 09:03 jpedro1992

So, how should I proceed to take care of this issue for this PR since both CRDs are not approved yet? What is the proper way to add novel CRDs and to solve the issue raised by "api-approved.kubernetes.io"? Thanks.

@jpedro1992 you can just check in changes to include the "api-approved.kubernetes.io" fields, just like other CRDs do. Don't forget to append # edited manually in the end.

Huang-Wei avatar Mar 10 '22 00:03 Huang-Wei

So, how should I proceed to take care of this issue for this PR since both CRDs are not approved yet? What is the proper way to add novel CRDs and to solve the issue raised by "api-approved.kubernetes.io"? Thanks.

@jpedro1992 you can just check in changes to include the "api-approved.kubernetes.io" fields, just like other CRDs do. Don't forget to append # edited manually in the end.

@Huang-Wei Thank you for getting back to me.

Currently, I have the following "api-approved.kubernetes.io" field in the AppGroup and NetworkTopology CRDs:

api-approved.kubernetes.io: https://github.com/kubernetes-sigs/scheduler-plugins/pull/348 # edited manually

I run hack/verify-crdgen.sh and it still fails. Then, if I manually edit the CRDs in the manifests/crds folder with the "api-approved.kubernetes.io" fields, the integration tests pass:

image

Should I amend my commit with this version, or is there any other way to solve this issue?

jpedro1992 avatar Mar 10 '22 09:03 jpedro1992

@jpedro1992 I got some suggestions from other reviewers to split the big PR into several commits. E.g. CRD/data type definitions + clientsets, Sorting + Filtering + Scoring implementation, examples, etc.

wangchen615 avatar Mar 10 '22 16:03 wangchen615

Should I amend my commit with this version

yes, please do.

@wangchen615 's suggestion is valid. It'd be very helpful for reviewing if you can split this PR into 2 smaller ones:

  • APIs, including versioned/unversioned type definitions, validation, defaulting, conversion and tests (if needed)
  • Pugin implementations and tests (UT and integration)

Huang-Wei avatar Mar 11 '22 00:03 Huang-Wei

Hi @Huang-Wei,

I believe I solved the issues with the integration tests. Also, I have reduced the size of the PR:

  1. CRDs + API definitions

Perhaps better to wait on the revision of this commit before pushing the Plugin Implementations + Tests.

Thanks

jpedro1992 avatar Mar 11 '22 11:03 jpedro1992

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 13 '22 09:06 k8s-triage-robot

/remove-lifecycle stale

Please let me know if I can provide further help/info on the PR revision.

jpedro1992 avatar Jun 14 '22 10:06 jpedro1992

It'd be very helpful for reviewing if you can split this PR into 2 smaller ones:

APIs, including versioned/unversioned type definitions, validation, defaulting, conversion and tests (if needed) Plugin implementations and tests (UT and integration)

Both points are not relevant until a plugin is introduced given this PR currently introduces only CRDs and a controller. Since CRD types under https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/apis/scheduling have their own validation/defaulting.

@Huang-Wei is there any policy against introducing new CRDs and their corresponding controllers ahead of plugins? The AppGroup type and its controller will probably have no use beside recomputing the topological sorting. On the other hand, is the idea of this repository to always provide fully functioning plugins? Or, rather allowing the community to also continuously accept controllers which provide selected/pre-computed view of the world? So consumers of the repository can build the scheduling framework as combination of already existing plugins/controllers + other custom plugins/controllers not yet merged in the repository.

ingvagabund avatar Aug 11 '22 13:08 ingvagabund

Both points are not relevant until a plugin is introduced given this PR currently introduces only CRDs and a controller. Since CRD types under master/apis/scheduling have their own validation/defaulting.

IIRC, that comment was made when the PR was combining all things together.

is there any policy against introducing new CRDs and their corresponding controllers ahead of plugins?

To make this repo sustainable (given limited reviewers' cycles), we have to use our bandwidth efficiently and focus on scheduling bits. So, I'm a bit inclined to have AppGroup and its controllers resided temporarily in its own repo, and moved back (if necessary) once it gets one or two release cycles to soak. Even if it's hosted outside, we can still compile it into our controller image. (it's just we need to be cautious to avoid cyclic dependency) In this way, @jpedro1992 you can evolve API and controller at your own pace. And in this repo, we focus on scheduling portion for now so that we won't be the blocker to see this plugin's debut in scheduler-plugins. WDYT?

BTW: an example is that NodeResourceTopology is hosted outside, but it doesn't mean we cannot build the scheduler plugin here.

Huang-Wei avatar Sep 02 '22 18:09 Huang-Wei

Hi @ingvagabund and @Huang-Wei,

Thank you for your revision! I will be working on the suggested code modifications.

For me it's fine if we keep the CRDs + controllers outside of sig-scheduling for now. Is there any guide / example I could follow from another plugin to know exactly how to host the components outside of sig-scheduling and build them inside?

I cannot just simply remove the CRDs + controllers and add the plugins since they require the AppGroup CRD.

Thanks in advance!

jpedro1992 avatar Sep 05 '22 11:09 jpedro1992

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jpedro1992 Once this PR has been reviewed and has the lgtm label, please assign denkensk for approval by writing /assign @denkensk in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 05 '22 13:09 k8s-ci-robot

Hi @ingvagabund and @Huang-Wei,

We decided to host the multiple components in the following repo: Diktyo-io

Regarding the PR, do you prefer to do the modifications in this PR (remove CRDs + APIs and replace it with plugins) or open a new PR with the plugins?

Thanks in advance!

jpedro1992 avatar Sep 07 '22 15:09 jpedro1992

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 07 '22 09:12 k8s-triage-robot

/close in favor of #432

Huang-Wei avatar Dec 08 '22 05:12 Huang-Wei

/close in favor of #432

I agree.

jpedro1992 avatar Dec 08 '22 09:12 jpedro1992