volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Tensorflow distributed mnist training problem

Open Trainbow opened this issue 2 years ago • 1 comments

What happened: I was trying to run the tf-dist-mnist-example.yaml to see how it works on the real k8s clusters. But I got the worker nodes are waiting for a reply and cannot receive the ps node's information.

I test it on volcano-1.5.1 and it works on the k8s. But i run the example on volcano-1.6.0 on clusters, it appears the problem above.

Environment:

  • Volcano Version:1.6.0

Trainbow avatar Aug 23 '22 11:08 Trainbow

the logs in ps node: image

the logs in worker node: image

Trainbow avatar Aug 23 '22 11:08 Trainbow

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Nov 22 '22 20:11 stale[bot]

已收到,谢谢!

Trainbow avatar Nov 22 '22 20:11 Trainbow

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Mar 18 '23 20:03 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar May 18 '23 22:05 stale[bot]