mars
mars copied to clipboard
What's best practice for setup cluster?
ENV: python 3.7.11 mars 0.9.0
import mars
import mars.tensor as mt
n = 20000
n_worker = 1
n_cpu = 1
mem_bytes = 20 * 2 ** 30
mars.new_session(init_local=True, n_worker=n_worker, n_cpu=n_cpu, mem_bytes=mem_bytes)
X = mt.random.RandomState(0).rand(n,n)
invX = mt.linalg.inv(X).execute()
While executing above script, the speed is positive correlate to n_cpu
and negative correlate to n_worker
.
Is it right result or I did something wrong?
If it was right, is that meaning I should always choose 1 worker with multi cpu , instead of multi worker with 1 cpu?
And what should I do if i can run on multi machines, but only a little cpu core every machine.
samples: execute time in seconds.
Worker\CPU | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
1 | 1100 | 546 | 291 | 184 | 124 |
2 | \ | 857 | 360 | 211 | 150 |
4 | \ | \ | 836 | 383 | 234 |
8 | \ | \ | \ | 417 | 310 |
16 | \ | \ | \ | \ | 555 |
Nice question, the performance may be related with many factors, seems the implementation of mt.linalg.inv
does not scale well when workers grow.
We need to do some investigation to see what happened.
@qinxuye And I found there were very busy network traffic. When doing mt.linalg.inv
on a 20k square matrix (the size is about 2.9GiB), there will be about 40GiB - 50 GiB data be transmit between nodes, from 2 workers cluster to 16 workers cluster.
I deployed the cluster over ray
backend. As I known, the background network throughput of ray is about 10-20 KiB/s.
@cmsxbc Do you mind to join slack https://join.slack.com/t/mars-computing/shared_invite/zt-1c39tdh83-K1AT9FmtKkUgOzmM6~Nwbg
So that we can know more info about the problem you are solving.