pai icon indicating copy to clipboard operation
pai copied to clipboard

Default k8s scheduler support

Open JosephKang opened this issue 3 years ago • 5 comments

Organization Name: Advantech

Short summary about the issue/question: Does the default k8s scheduler support in opnepai v1.5.0? How to run CPU and GPU tasks on single GPU worker?(E.g. https://github.com/microsoft/pai/issues/5044)

Brief what process you are following: In v1.0.1, we can use the k8s default scheduler based on https://github.com/microsoft/pai/issues/5044#issuecomment-720410187. When I change the k8s default scheduler, the SKU based scheduling seems incorrect. Does the default k8s scheduler support in opnepai v1.5.0 or in the future release?

How to reproduce it:

  1. Deploy openpai v1.5.0
  2. Change the scheduler as follows
hivedscheduler:
  config: |
  1. Apply the configuration ./paictl.py service stop -n rest-server hivedscheduler ./paictl.py config push -p -m service ./paictl.py service start -n hivedscheduler rest-server
  2. The SKU on the job submission seems incorrect undef_sku

OpenPAI Environment:

  • OpenPAI version: v1.5.0
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.3 LTS

JosephKang avatar Apr 08 '21 03:04 JosephKang

Not sure if we still support default scheduler. @abuccts to help

yqwang-ms avatar Apr 08 '21 07:04 yqwang-ms

@JosephKang , could you describe the detailed scheduling behavior? We didn't test the default scheduler for a while.

fanyangCS avatar Apr 09 '21 07:04 fanyangCS

The default scheduler is used to set the job resource on demand instead of SKU unit allocation, and it might be achieve the maximum utilization of the worker node.

The following scenarios might be a good example for one worker with 1GPU/9 CPU resource. Please let me know if my understanding is incorrect Scenario a. One 1GPU/4CPU task and one 4CPU task at the same time Scenario b. Two 4 CPU tasks at the same time.

JosephKang avatar Apr 11 '21 15:04 JosephKang

It seems you are asking if webportal is allowed to assign a fraction of resource other than the defined SKU? And yes, we prefer users to use the resource in the granularity of sku to avoid unnecessary fragmentation (so in webportal you cannot set resource other than sku). If you want more fine-grained resource usage, you can specify the resource usage through OpenPAI SDK.

fanyangCS avatar Apr 12 '21 06:04 fanyangCS

We hope to have more fine-grained resource usage. It seems that the assign task pod can be set based on API parameters instead of SKU unit, but the available resource deduction seems to be based on the granularity of SKU.

E.g.
Total resource = 2GPU, 8CPU and 50G RAM
 SKU                = 1GPU/4CPU/25G RAM,
Request API    =           2CPU/ 20G RAM
Reminding available = 1GPU/4CPU/25G RAM (1 SKU left)

Is it also a preferred behavior in order to sync sku?

JosephKang avatar Apr 14 '21 06:04 JosephKang