mars What's best practice for setup cluster?

What's best practice for setup cluster?

Open cmsxbc opened this issue 2 years ago • 3 comments

ENV: python 3.7.11 mars 0.9.0

import mars
import mars.tensor as mt

n = 20000
n_worker = 1
n_cpu = 1
mem_bytes = 20 * 2 ** 30

mars.new_session(init_local=True, n_worker=n_worker, n_cpu=n_cpu, mem_bytes=mem_bytes)

X = mt.random.RandomState(0).rand(n,n)
invX = mt.linalg.inv(X).execute()

While executing above script, the speed is positive correlate to n_cpu and negative correlate to n_worker. Is it right result or I did something wrong? If it was right, is that meaning I should always choose 1 worker with multi cpu , instead of multi worker with 1 cpu? And what should I do if i can run on multi machines, but only a little cpu core every machine.

samples: execute time in seconds.

Worker\CPU	1	2	4	8	16
1	1100	546	291	184	124
2	\	857	360	211	150
4	\	\	836	383	234
8	\	\	\	417	310
16	\	\	\	\	555

Jul 05 '22 07:07 cmsxbc

Nice question, the performance may be related with many factors, seems the implementation of mt.linalg.inv does not scale well when workers grow.

We need to do some investigation to see what happened.

Jul 06 '22 02:07 qinxuye

@qinxuye And I found there were very busy network traffic. When doing mt.linalg.inv on a 20k square matrix (the size is about 2.9GiB), there will be about 40GiB - 50 GiB data be transmit between nodes, from 2 workers cluster to 16 workers cluster. I deployed the cluster over ray backend. As I known, the background network throughput of ray is about 10-20 KiB/s.

Jul 07 '22 02:07 cmsxbc

@cmsxbc Do you mind to join slack https://join.slack.com/t/mars-computing/shared_invite/zt-1c39tdh83-K1AT9FmtKkUgOzmM6~Nwbg

So that we can know more info about the problem you are solving.

Jul 07 '22 04:07 qinxuye

mars mars copied to clipboard

What's best practice for setup cluster?

mars
mars copied to clipboard