mars icon indicating copy to clipboard operation
mars copied to clipboard

What's best practice for setup cluster?

Open cmsxbc opened this issue 2 years ago • 3 comments

ENV: python 3.7.11 mars 0.9.0

import mars
import mars.tensor as mt

n = 20000
n_worker = 1
n_cpu = 1
mem_bytes = 20 * 2 ** 30

mars.new_session(init_local=True, n_worker=n_worker, n_cpu=n_cpu, mem_bytes=mem_bytes)

X = mt.random.RandomState(0).rand(n,n)
invX = mt.linalg.inv(X).execute()

While executing above script, the speed is positive correlate to n_cpu and negative correlate to n_worker. Is it right result or I did something wrong? If it was right, is that meaning I should always choose 1 worker with multi cpu , instead of multi worker with 1 cpu? And what should I do if i can run on multi machines, but only a little cpu core every machine.

samples: execute time in seconds.

Worker\CPU 1 2 4 8 16
1 1100 546 291 184 124
2 \ 857 360 211 150
4 \ \ 836 383 234
8 \ \ \ 417 310
16 \ \ \ \ 555

cmsxbc avatar Jul 05 '22 07:07 cmsxbc

Nice question, the performance may be related with many factors, seems the implementation of mt.linalg.inv does not scale well when workers grow.

We need to do some investigation to see what happened.

qinxuye avatar Jul 06 '22 02:07 qinxuye

@qinxuye And I found there were very busy network traffic. When doing mt.linalg.inv on a 20k square matrix (the size is about 2.9GiB), there will be about 40GiB - 50 GiB data be transmit between nodes, from 2 workers cluster to 16 workers cluster. I deployed the cluster over ray backend. As I known, the background network throughput of ray is about 10-20 KiB/s.

cmsxbc avatar Jul 07 '22 02:07 cmsxbc

@cmsxbc Do you mind to join slack https://join.slack.com/t/mars-computing/shared_invite/zt-1c39tdh83-K1AT9FmtKkUgOzmM6~Nwbg

So that we can know more info about the problem you are solving.

qinxuye avatar Jul 07 '22 04:07 qinxuye