volcano
volcano copied to clipboard
Tensorflow distributed mnist training problem
What happened:
I was trying to run the tf-dist-mnist-example.yaml
to see how it works on the real k8s clusters. But I got the worker nodes are waiting for a reply and cannot receive the ps node's information.
I test it on volcano-1.5.1 and it works on the k8s. But i run the example on volcano-1.6.0 on clusters, it appears the problem above.
Environment:
- Volcano Version:1.6.0
the logs in ps node:
the logs in worker node:
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
已收到,谢谢!
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗