mars [BUG] Distributed training failed on Ray cluster

[BUG] Distributed training failed on Ray cluster

Open hekaisheng opened this issue 2 years ago • 0 comments

Describe the bug Mars integrates some deep learning frameworks(PyTorch, TensorFlow), these frameworks usually need to set some environments for distributed training, TF_CONFIG for TensorFlow, MASTER_ADDR for PyTorch. We use ctx.get_worker_addresses() to collect all worker addresses, it works well for Oscar backend, while for Ray, the addresses start with ray:// which is invalid for them, we need a method to get worker's host IP not address to address the issue.

Mar 16 '22 10:03 hekaisheng

mars mars copied to clipboard

[BUG] Distributed training failed on Ray cluster

mars
mars copied to clipboard