xgboost-operator
xgboost-operator copied to clipboard
extract_xgbooost_cluster_env() and xgb.rabit.get_rank() get different rank number
I ran distributed training on k8s.
The rank
number was got by extract_xgbooost_cluster_env()
as in https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/train.py#L29
However, xgb.rabit.get_rank()
got another rank
number as in https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/train.py#L57.
There are two things confusing me:
- Now that
extract_xgbooost_cluster_env()
had got therank
number, why usexgb.rabit.get_rank()
to getrank
number again? - Why are the two
rank
numbers different?
Exactly. I also see the same problem today.
(base) asaha-mbp151:maven asaha$ kubectl logs -f xgboost-asaha-rfw4as3le3u-master-0 -n asaha
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-asaha-rfw4as3le3u-master-0, port: 9991, rank: 0, world_size: 3
start the master node
start listen on 0.0.0.0:9991
###### RabitTracker Setup Finished ######
##### Rabit rank setup with below envs #####
DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-asaha-rfw4as3le3u-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
worker(ip_address=10.46.85.245) connected!
worker(ip_address=10.46.95.126) connected!
##### Rabit rank = 1
@tracker All of 3 nodes getting started
worker(ip_address=10.44.239.26) connected!
Read data from IRIS data source with range from 50 to 100
starting to train xgboost at node with rank 1
@terrytangyuan if you could please comment.
if you check the code, you can find the second rank is used only on the worker node, this is used to decide the rabbit worker id.
On Wed, Mar 17, 2021 at 4:43 AM Anindya Saha @.***> wrote:
Exactly. I also see the same problem today.
(base) asaha-mbp151:maven asaha$ kubectl logs -f xgboost-asaha-rfw4as3le3u-master-0 -n asaha starting the train job starting to extract system env extract the Rabit env from cluster : xgboost-asaha-rfw4as3le3u-master-0, port: 9991, rank: 0, world_size: 3 start the master node start listen on 0.0.0.0:9991
RabitTracker Setup Finished
Rabit rank setup with below envs
DMLC_NUM_WORKER=3 DMLC_TRACKER_URI=xgboost-asaha-rfw4as3le3u-master-0 DMLC_TRACKER_PORT=9991 DMLC_TASK_ID=0 worker(ip_address=10.46.85.245) connected! worker(ip_address=10.46.95.126) connected!
Rabit rank = 1
@tracker All of 3 nodes getting started worker(ip_address=10.44.239.26) connected! Read data from IRIS data source with range from 50 to 100 starting to train xgboost at node with rank 1
@terrytangyuan https://github.com/terrytangyuan if you could please comment.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubeflow/xgboost-operator/issues/106#issuecomment-800591016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5R6JZGN6WZI7UHSSOZLDTD67IJANCNFSM4TZST7MQ .