CloudShuffleService icon indicating copy to clipboard operation
CloudShuffleService copied to clipboard

cannot run css cluster

Open Lobo2008 opened this issue 2 years ago • 4 comments

  1. if I use all default settings, and start as sbin/start-all.sh, a Worker and a Master are running but when I submit a Spark app, it throw Caused by: java.lang.RuntimeException: replica num must less than worker num

  2. if I run zk mode by changing conf/css-default.cnf as:

css.zookeeper.address=MyZkIP:2181
css.worker.registry.type=zookeeper

and start as sbin/start-workers.sh or sbin/start-worker.sh or sbin/start-all.sh it throw

com.bytedance.css.service.deploy.worker.Worker --host xxx07v.xxxx.net
failed to launch: nice -n 0 /yy/java8/bin/java -Xmx1024m -XX:MaxDirectMemorySize=4096m -Dcss.log.dir=/home/aa/css/logs -Dcss.log.filename=css-aa-worker-1.out -classpath /yyy/java8/lib:/home/aa/css/lib/* com.bytedance.css.service.deploy.worker.Worker --host  xxx07v.xxxx.net
tail: 无法打开"/home/aa/css/logs/css-aa-worker-1.out" 读取数据: 没有那个文件或目录
full log in /home/aa/css/logs/css-aa-worker-1.out
  1. if I deploy as the README.md and set 3 workers in conf/workers + zk mode and start-workers.sh then entering my password of the 3 workers, it returns pemission denied I am sure my password is correct.

Any suggestion? I think the README is ambiguous

Lobo2008 avatar Sep 28 '22 04:09 Lobo2008

  1. css push data should use two replicas, so we should start two workers at least
  2. does dir /home/aa/css/logs exist? or dir permission issues?
  3. the machine which exec start-workers.sh should be set ssh-without-password with all of the workers

bdyx123 avatar Sep 28 '22 08:09 bdyx123

  1. css push data should use two replicas, so we should start two workers at least
  2. does dir /home/aa/css/logs exist? or dir permission issues?
  3. the machine which exec start-workers.sh should be set ssh-without-password with all of the workers

no permission issues because I change nothing and start as start-all.sh can produce the corresponding master and worker log, but when I change to zk, it failed.

I have 3 nodes with ip IP_A,IP_B,IP_C and want to use zk mode

  • conf/workers
    IP_A
    IP_B
    IP_C
    
  • conf/css-default.conf
    css.cluster.name=bytedance
    css.commit.threads=128
    css.flush.timeout=360s
    css.network.timeout=600s
    css.disk.dir.num.min=5
    css.extMeta.expire.interval=600s
    css.zookeeper.address=My_Zk_IP:2181
    css.worker.registry.type=zookeeper
    css.cluster.name=bytedance
    

then the dir is sent to the 3 nodes, how should I change other settings and run the scripts to make them work ?

I suppose run start-workers.sh on one of the 3 nodes should work, css will check the workers to start all the 3 workers. OR run start-worker.sh on every node and the worker will start its worker process(at this point, the other 2 IP should be deleted?)

Lobo2008 avatar Sep 28 '22 09:09 Lobo2008

yes, start-workers.sh can work, haven't you start it?

bdyx123 avatar Oct 08 '22 10:10 bdyx123

My CSS cluster is up running with the zookeeper registry type in k8s now. Everything looks fine until I run a Spark app. The application log shows the same error message:

java.lang.RuntimeException: replica num must less than worker num

The stats in my zookeeper is

[zk: localhost:2181(CONNECTED) 4] ls /css/my2css/workers
[css-0:39477:32875:35149, css-0:41865:46557:46199, css-0:43149:36579:33897, css-0:46573:36469:44793, css-0:46679:46533:41791, css-1:35421:36815:43879, css-1:39127:39883:44297, css-1:42185:42751:44815, css-1:43769:41983:33951]

the environment variable has set export CSS_WORKER_INSTANCES=2 Could you please let us know which configuration sets the replica number and which one measures the worker number? anything else I have missed?

a140262 avatar Oct 29 '22 00:10 a140262