graphstorm
graphstorm copied to clipboard
Remapping process launch does not propagate extra-envs parameter
Got this when running with a custom DGL installation.
Using a training command e.g.
python3 -m graphstorm.run.gs_node_classification \
--extra-envs LD_LIBRARY_PATH="/opt/gs-venv/lib/python3.9/site-packages/dgl/:$LD_LIBRARY_PATH" \
--num-trainers 1 \
--num-servers 1 \
--num-samplers 0 \
--part-config /efs1/ogbn_arxiv_nc_train_val_4parts/ogbn-arxiv.json \
--ip-config /efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt \
--ssh-port 2222 \
--cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \
--save-perf-results-path /efs1/ogbn-arxiv-nc/models \
--use-graphbolt False
The commands launched in each worker will include the extra-envs:
ssh -o StrictHostKeyChecking=no -p 2222 172.31.66.83 'cd /; (export LD_LIBRARY_PATH=/opt/gs-venv/lib/python3.9/site-packages/dgl/:/opt/gs-venv/lib/python3.9/site-packages/dgl/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL
_NUM_CLIENT=4 DGL_CONF_PATH=/efs1/ogbn_arxiv_nc_train_val_4parts/ogbn-arxiv.json DGL_IP_CONFIG=/efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=16 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python/:/root/dgl/python/:/root/dgl/tools/: ; /opt/gs-venv/bin/python3 -u -m torch.distributed.run --nproc_pe
r_node=1 --nnodes=4 --node_rank=3 --master_addr=172.31.69.196 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml --save-perf-results-path /efs1/ogbn-arxiv-nc/models --ip-config /efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt --part-config /efs1/ogbn_arxiv_nc_train_va
l_4parts/ogbn-arxiv.json)
However, the launched remap processes do not propagate the same env vars:
ssh -o StrictHostKeyChecking=no -p 2222 172.31.70.116 'cd /; (export PYTHONPATH=/graphstorm/python/:/root/dgl/python/:/root/dgl/tools/: ; /opt/gs-venv/bin/python3 -m graphstorm.gconstruct.remap_result --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml --save-perf-results-path /efs1/ogbn-arxiv-nc/models --use-graphbolt False --ip-config /efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt --part-config /efs1/ogbn_arxiv_nc_train_val_4parts/ogbn-arxiv.json --rank 1 --world-size 4 --with-shared-fs True --num-processes 1 --output-chunk-size 100000 --preserve-input False)
Since I rely on a custom DGL installation, the above will fail with
RuntimeError: Cannot find the files.
List of candidates:
/root/dgl/python/dgl/libdgl.so
/root/dgl/build/libdgl.so
/root/dgl/build/Release/libdgl.so
/root/dgl/lib/libdgl.so
/root/libdgl.so
We should be propagating any extra-envs to the remap processes.
@classicsong