dlrover
dlrover copied to clipboard
example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml
环境
- 运行分支为 master
- k8s 版本为 1.22
- cuda version 12.3
问题
执行了kubectl apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml
,worker 节点一直未出现,只有一个 master 在
kubectl get po -n dlrover
NAME READY STATUS RESTARTS AGE
dlrover-controller-manager-85cb9778b-9sqb8 2/2 Running 0 5d16h
elasticjob-deepctr-manual-scale-dlrover-master 1/1 Running 0 18h
出现了几千条 scanPlan 数据
dlrover deepctr-manual-scale-scaleplan-986 Scaling 102m
dlrover deepctr-manual-scale-scaleplan-987 Scaling 101m
dlrover deepctr-manual-scale-scaleplan-988 Scaling 100m
dlrover deepctr-manual-scale-scaleplan-989 Scaling 99m
dlrover deepctr-manual-scale-scaleplan-99 Scaling 16h
dlrover deepctr-manual-scale-scaleplan-990 Succeeded 98m
dlrover deepctr-manual-scale-scaleplan-991 Succeeded 97m
dlrover deepctr-manual-scale-scaleplan-992 Scaling 96m
dlrover deepctr-manual-scale-scaleplan-993 Scaling 95m
dlrover deepctr-manual-scale-scaleplan-994 Succeeded 94m
dlrover deepctr-manual-scale-scaleplan-995 Succeeded 93m
dlrover deepctr-manual-scale-scaleplan-996 Scaling 92m
dlrover deepctr-manual-scale-scaleplan-997 Succeeded 91m
dlrover deepctr-manual-scale-scaleplan-998 Succeeded 90m
dlrover deepctr-manual-scale-scaleplan-999 Succeeded 89m
且这些 scanPlan 的数据都是空的 :
kubectl describe scaleplans.elastic.iml.github.io -n dlrover deepctr-manual-scale-scaleplan-999
Name: deepctr-manual-scale-scaleplan-999
Namespace: dlrover
Labels: scale-type=auto
Annotations: <none>
API Version: elastic.iml.github.io/v1alpha1
Kind: ScalePlan
Metadata:
Creation Timestamp: 2024-05-21T02:23:58Z
Generation: 1
Managed Fields:
API Version: elastic.iml.github.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:scale-type:
f:ownerReferences:
.:
k:{"uid":"a7665789-d3cd-4b42-998b-35e12a7e8d8f"}:
f:spec:
.:
f:createPods:
f:ownerJob:
f:psHosts:
f:removePods:
f:replicaResourceSpecs:
Manager: OpenAPI-Generator
Operation: Update
Time: 2024-05-21T02:23:58Z
API Version: elastic.iml.github.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:createTime:
f:finishTime:
f:phase:
Manager: manager
Operation: Update
Subresource: status
Time: 2024-05-21T02:23:58Z
Owner References:
API Version: elastic.iml.github.io/v1alpha1
Block Owner Deletion: true
Kind: elasticjob
Name: deepctr-manual-scale
UID: a7665789-d3cd-4b42-998b-35e12a7e8d8f
Resource Version: 43898773
UID: 8f451cf8-2d2c-4a8b-8208-b146187599e9
Spec:
Create Pods:
Owner Job: deepctr-manual-scale
Ps Hosts:
Remove Pods:
Replica Resource Specs:
Status:
Create Time: 2024-05-21T02:23:58Z
Finish Time: 2024-05-21T02:23:58Z
Phase: Succeeded
Events: <none>
请问如何才能验证一个 tensorflow 的弹性,无论是手工的还是自动的