graphstorm icon indicating copy to clipboard operation
graphstorm copied to clipboard

Remapping process launch does not propagate extra-envs parameter

Open thvasilo opened this issue 1 year ago • 0 comments

Got this when running with a custom DGL installation.

Using a training command e.g.

python3 -m graphstorm.run.gs_node_classification \
        --extra-envs LD_LIBRARY_PATH="/opt/gs-venv/lib/python3.9/site-packages/dgl/:$LD_LIBRARY_PATH" \
        --num-trainers 1 \
        --num-servers 1 \
        --num-samplers 0 \
        --part-config /efs1/ogbn_arxiv_nc_train_val_4parts/ogbn-arxiv.json \
        --ip-config  /efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt \
        --ssh-port 2222 \
        --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \
        --save-perf-results-path /efs1/ogbn-arxiv-nc/models \
        --use-graphbolt False

The commands launched in each worker will include the extra-envs:

ssh -o StrictHostKeyChecking=no -p 2222 172.31.66.83 'cd /; (export LD_LIBRARY_PATH=/opt/gs-venv/lib/python3.9/site-packages/dgl/:/opt/gs-venv/lib/python3.9/site-packages/dgl/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL
_NUM_CLIENT=4 DGL_CONF_PATH=/efs1/ogbn_arxiv_nc_train_val_4parts/ogbn-arxiv.json DGL_IP_CONFIG=/efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=16 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python/:/root/dgl/python/:/root/dgl/tools/: ; /opt/gs-venv/bin/python3 -u -m torch.distributed.run --nproc_pe
r_node=1 --nnodes=4 --node_rank=3 --master_addr=172.31.69.196 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml --save-perf-results-path /efs1/ogbn-arxiv-nc/models --ip-config /efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt --part-config /efs1/ogbn_arxiv_nc_train_va
l_4parts/ogbn-arxiv.json)

However, the launched remap processes do not propagate the same env vars:

ssh -o StrictHostKeyChecking=no -p 2222 172.31.70.116 'cd /; (export PYTHONPATH=/graphstorm/python/:/root/dgl/python/:/root/dgl/tools/: ; /opt/gs-venv/bin/python3 -m graphstorm.gconstruct.remap_result --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml --save-perf-results-path /efs1/ogbn-arxiv-nc/models --use-graphbolt False --ip-config /efs1/911734752298-us-east-1-4x-g5.8xlarge-ip_list.txt --part-config /efs1/ogbn_arxiv_nc_train_val_4parts/ogbn-arxiv.json --rank 1 --world-size 4 --with-shared-fs True --num-processes 1 --output-chunk-size 100000 --preserve-input False)

Since I rely on a custom DGL installation, the above will fail with

RuntimeError: Cannot find the files.
List of candidates:
/root/dgl/python/dgl/libdgl.so
/root/dgl/build/libdgl.so
/root/dgl/build/Release/libdgl.so
/root/dgl/lib/libdgl.so
/root/libdgl.so

We should be propagating any extra-envs to the remap processes.

@classicsong

thvasilo avatar May 16 '24 21:05 thvasilo