CloudShuffleService
CloudShuffleService copied to clipboard
cannot run css cluster
-
if I use all default settings, and start as
sbin/start-all.sh
, a Worker and a Master are running but when I submit a Spark app, it throwCaused by: java.lang.RuntimeException: replica num must less than worker num
-
if I run zk mode by changing
conf/css-default.cnf
as:
css.zookeeper.address=MyZkIP:2181
css.worker.registry.type=zookeeper
and start as sbin/start-workers.sh
or sbin/start-worker.sh
or sbin/start-all.sh
it throw
com.bytedance.css.service.deploy.worker.Worker --host xxx07v.xxxx.net
failed to launch: nice -n 0 /yy/java8/bin/java -Xmx1024m -XX:MaxDirectMemorySize=4096m -Dcss.log.dir=/home/aa/css/logs -Dcss.log.filename=css-aa-worker-1.out -classpath /yyy/java8/lib:/home/aa/css/lib/* com.bytedance.css.service.deploy.worker.Worker --host xxx07v.xxxx.net
tail: 无法打开"/home/aa/css/logs/css-aa-worker-1.out" 读取数据: 没有那个文件或目录
full log in /home/aa/css/logs/css-aa-worker-1.out
- if I deploy as the
README.md
and set 3 workers inconf/workers
+ zk mode andstart-workers.sh
then entering my password of the 3 workers, it returnspemission denied
I am sure my password is correct.
Any suggestion? I think the README
is ambiguous
- css push data should use two replicas, so we should start two workers at least
- does dir /home/aa/css/logs exist? or dir permission issues?
- the machine which exec start-workers.sh should be set ssh-without-password with all of the workers
- css push data should use two replicas, so we should start two workers at least
- does dir /home/aa/css/logs exist? or dir permission issues?
- the machine which exec start-workers.sh should be set ssh-without-password with all of the workers
no permission issues because I change nothing and start as start-all.sh
can produce the corresponding master and worker log, but when I change to zk, it failed.
I have 3 nodes with ip IP_A,IP_B,IP_C and want to use zk mode
-
conf/workers
IP_A IP_B IP_C
-
conf/css-default.conf
css.cluster.name=bytedance css.commit.threads=128 css.flush.timeout=360s css.network.timeout=600s css.disk.dir.num.min=5 css.extMeta.expire.interval=600s css.zookeeper.address=My_Zk_IP:2181 css.worker.registry.type=zookeeper css.cluster.name=bytedance
then the dir is sent to the 3 nodes, how should I change other settings and run the scripts to make them work ?
I suppose run start-workers.sh
on one of the 3 nodes should work, css will check the workers
to start all the 3 workers.
OR
run start-worker.sh
on every node and the worker will start its worker process(at this point, the other 2 IP should be deleted?)
yes, start-workers.sh can work, haven't you start it?
My CSS cluster is up running with the zookeeper registry type in k8s now. Everything looks fine until I run a Spark app. The application log shows the same error message:
java.lang.RuntimeException: replica num must less than worker num
The stats in my zookeeper is
[zk: localhost:2181(CONNECTED) 4] ls /css/my2css/workers
[css-0:39477:32875:35149, css-0:41865:46557:46199, css-0:43149:36579:33897, css-0:46573:36469:44793, css-0:46679:46533:41791, css-1:35421:36815:43879, css-1:39127:39883:44297, css-1:42185:42751:44815, css-1:43769:41983:33951]
the environment variable has set export CSS_WORKER_INSTANCES=2
Could you please let us know which configuration sets the replica number and which one measures the worker number? anything else I have missed?