dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml

Open jason-i-vv opened this issue 1 month ago • 0 comments

环境

  1. 运行分支为 master
  2. k8s 版本为 1.22
  3. cuda version 12.3

问题

执行了kubectl apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml,worker 节点一直未出现,只有一个 master 在

kubectl  get po -n dlrover
NAME                                             READY   STATUS    RESTARTS   AGE
dlrover-controller-manager-85cb9778b-9sqb8       2/2     Running   0          5d16h
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running   0          18h

出现了几千条 scanPlan 数据

dlrover     deepctr-manual-scale-scaleplan-986    Scaling     102m
dlrover     deepctr-manual-scale-scaleplan-987    Scaling     101m
dlrover     deepctr-manual-scale-scaleplan-988    Scaling     100m
dlrover     deepctr-manual-scale-scaleplan-989    Scaling     99m
dlrover     deepctr-manual-scale-scaleplan-99     Scaling     16h
dlrover     deepctr-manual-scale-scaleplan-990    Succeeded   98m
dlrover     deepctr-manual-scale-scaleplan-991    Succeeded   97m
dlrover     deepctr-manual-scale-scaleplan-992    Scaling     96m
dlrover     deepctr-manual-scale-scaleplan-993    Scaling     95m
dlrover     deepctr-manual-scale-scaleplan-994    Succeeded   94m
dlrover     deepctr-manual-scale-scaleplan-995    Succeeded   93m
dlrover     deepctr-manual-scale-scaleplan-996    Scaling     92m
dlrover     deepctr-manual-scale-scaleplan-997    Succeeded   91m
dlrover     deepctr-manual-scale-scaleplan-998    Succeeded   90m
dlrover     deepctr-manual-scale-scaleplan-999    Succeeded   89m

且这些 scanPlan 的数据都是空的 :

 kubectl describe scaleplans.elastic.iml.github.io -n dlrover deepctr-manual-scale-scaleplan-999
Name:         deepctr-manual-scale-scaleplan-999
Namespace:    dlrover
Labels:       scale-type=auto
Annotations:  <none>
API Version:  elastic.iml.github.io/v1alpha1
Kind:         ScalePlan
Metadata:
  Creation Timestamp:  2024-05-21T02:23:58Z
  Generation:          1
  Managed Fields:
    API Version:  elastic.iml.github.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:scale-type:
        f:ownerReferences:
          .:
          k:{"uid":"a7665789-d3cd-4b42-998b-35e12a7e8d8f"}:
      f:spec:
        .:
        f:createPods:
        f:ownerJob:
        f:psHosts:
        f:removePods:
        f:replicaResourceSpecs:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2024-05-21T02:23:58Z
    API Version:  elastic.iml.github.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:createTime:
        f:finishTime:
        f:phase:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2024-05-21T02:23:58Z
  Owner References:
    API Version:           elastic.iml.github.io/v1alpha1
    Block Owner Deletion:  true
    Kind:                  elasticjob
    Name:                  deepctr-manual-scale
    UID:                   a7665789-d3cd-4b42-998b-35e12a7e8d8f
  Resource Version:        43898773
  UID:                     8f451cf8-2d2c-4a8b-8208-b146187599e9
Spec:
  Create Pods:
  Owner Job:  deepctr-manual-scale
  Ps Hosts:
  Remove Pods:
  Replica Resource Specs:
Status:
  Create Time:  2024-05-21T02:23:58Z
  Finish Time:  2024-05-21T02:23:58Z
  Phase:        Succeeded
Events:         <none>

请问如何才能验证一个 tensorflow 的弹性,无论是手工的还是自动的

jason-i-vv avatar May 21 '24 04:05 jason-i-vv