xgboost-operator icon indicating copy to clipboard operation
xgboost-operator copied to clipboard

extract_xgbooost_cluster_env() and xgb.rabit.get_rank() get different rank number

Open wulikai1993 opened this issue 4 years ago • 2 comments

I ran distributed training on k8s.

The rank number was got by extract_xgbooost_cluster_env() as in https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/train.py#L29

However, xgb.rabit.get_rank() got another rank number as in https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/train.py#L57.

There are two things confusing me:

  1. Now that extract_xgbooost_cluster_env() had got the rank number, why usexgb.rabit.get_rank() to get rank number again?
  2. Why are the two rank numbers different?

wulikai1993 avatar Nov 18 '20 07:11 wulikai1993

Exactly. I also see the same problem today.

(base) asaha-mbp151:maven asaha$ kubectl logs -f xgboost-asaha-rfw4as3le3u-master-0 -n asaha
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-asaha-rfw4as3le3u-master-0, port: 9991, rank: 0, world_size: 3 
start the master node
start listen on 0.0.0.0:9991
###### RabitTracker Setup Finished ######
##### Rabit rank setup with below envs #####
DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-asaha-rfw4as3le3u-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
worker(ip_address=10.46.85.245) connected!
worker(ip_address=10.46.95.126) connected!
##### Rabit rank = 1
@tracker All of 3 nodes getting started
worker(ip_address=10.44.239.26) connected!
Read data from IRIS data source with range from 50 to 100
starting to train xgboost at node with rank 1

@terrytangyuan if you could please comment.

asahalyft avatar Mar 16 '21 20:03 asahalyft

if you check the code, you can find the second rank is used only on the worker node, this is used to decide the rabbit worker id.

On Wed, Mar 17, 2021 at 4:43 AM Anindya Saha @.***> wrote:

Exactly. I also see the same problem today.

(base) asaha-mbp151:maven asaha$ kubectl logs -f xgboost-asaha-rfw4as3le3u-master-0 -n asaha starting the train job starting to extract system env extract the Rabit env from cluster : xgboost-asaha-rfw4as3le3u-master-0, port: 9991, rank: 0, world_size: 3 start the master node start listen on 0.0.0.0:9991

RabitTracker Setup Finished
Rabit rank setup with below envs

DMLC_NUM_WORKER=3 DMLC_TRACKER_URI=xgboost-asaha-rfw4as3le3u-master-0 DMLC_TRACKER_PORT=9991 DMLC_TASK_ID=0 worker(ip_address=10.46.85.245) connected! worker(ip_address=10.46.95.126) connected!

Rabit rank = 1

@tracker All of 3 nodes getting started worker(ip_address=10.44.239.26) connected! Read data from IRIS data source with range from 50 to 100 starting to train xgboost at node with rank 1

@terrytangyuan https://github.com/terrytangyuan if you could please comment.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubeflow/xgboost-operator/issues/106#issuecomment-800591016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5R6JZGN6WZI7UHSSOZLDTD67IJANCNFSM4TZST7MQ .

merlintang avatar Mar 17 '21 06:03 merlintang