graphstorm
graphstorm copied to clipboard
[Gpartition] Process failures not recognized between process launches.
Observed the following today:
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.93.146 '(export DGL_IP_CONFIG=/ip_list.txt DGL_NUM_SERVER=1 PYTHONPATH=/graphstorm/python/:/root/dgl/tools/: RANK=16 MASTER_ADDR=172.31.95.143 MASTER_PORT=12345; /opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py --world-size 20 --partitions-dir /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/partition_assignment --input-dir /mount/gsprocessing/spear-local-graph-04-23 --graph-name spear-global-graph-0423-features --schema updated_row_counts_metadata.json --num-parts 20 --output /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/dist_graph --process-group-timeout 1800 --log-level INFO --save-orig-nids --save-orig-eids )'' returned non-zero exit status 1.
cleanupu process runs
"/opt/gs-venv/bin/python3 /root/dgl/tools/distgraphlaunch.py --ssh_port 2222 --num_proc_per_machine 1 --ip_config /ip_list.txt --master_port 12345 ""/opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py --world-size 20 --partitions-dir /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/partition_assignment --input-dir /mount/gsprocessing/spear-local-graph-04-23 --graph-name spear-global-graph-0423-features --schema updated_row_counts_metadata.json --num-parts 20 --output /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/dist_graph --process-group-timeout 1800 --log-level INFO --save-orig-nids --save-orig-eids """
INFO:root:DGL graph building took -3983.498513 sec
INFO:root:Copying raw_id_mappings to dist_graph
INFO:root:Partition assignment and DGL graph creation took 4722.130789 seconds
"Setting the default backend to ""pytorch"". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase)"
Here the opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py process failed, but the failure was not detected, the parent process exited with a zero exit code, and the job reported a success instead of a failure.
I remember we have trouble detecting process failures in general, so adding this issue to track. We can add related failures here.