mars
mars copied to clipboard
[BUG] Distributed training failed on Ray cluster
Describe the bug
Mars integrates some deep learning frameworks(PyTorch, TensorFlow), these frameworks usually need to set some environments for distributed training, TF_CONFIG
for TensorFlow, MASTER_ADDR
for PyTorch. We use ctx.get_worker_addresses()
to collect all worker addresses, it works well for Oscar backend, while for Ray, the addresses start with ray://
which is invalid for them, we need a method to get worker's host IP not address to address the issue.