notebooks
notebooks copied to clipboard
Sagemaker distributed training data parallelism notebook does not work - "Orted process exited"
I am trying out the sagemaker notebooks on AWS. The third notebook, distributed training data parallelism, does not work. I believe the problem is that the training process on the 2nd doesn't start up correctly but not sure. Here are the steps to reproduce:
- In AWS console, go to: Amazon SageMaker > Notebook instances > Create notebook instance
- set name: aa-huggingface-test
- Leave default settings EXCEPT: -- create new role (defaults except NONE for "S3 buckets you specify") -- clone public github repository: https://github.com/huggingface/notebooks
- Click Create
- Wait for status = "InService"
- Click "Open Jupyter"
- In Jupyter navigate to notebooks > sagemaker > 03_distributed_training_data_parallelism
- Popup for "Kernel not found": Could not find a kernel matching Python 3.8.5 64-bit ('hf': conda). Please select a kernel
- Choose "conda_pytorch_p39"
- Click "Set kernel"
- Wait for kernel to start and connect
-Execute cells up to and including:
huggingface_estimator.fit()
This launches a Sagemaker training job with two instances. The first instance emits a LOT of log message; I have trimmed that to just the "error" logs toward the end. I am also including the full log for the second instance. The final log message in that one is: "Orted process exited." instance1.txt instance2.txt
I made a mistake on the second instance log; the full log is now attached here. instance2.txt
I notice that the first instance log message on connecting to 2nd is:
Can connect to host algo-2 at port 22
vs the second instance:
Can connect to host algo-1
In the sagemaker-training-toolkit code these appear to be coming from different classes. The first one is a log message from smdataparallel.py, whereas the second is from mpi.py. But maybe that is expected.
Finally, here is the cmd that ran on one of the nodes (I assume the "master" node):
AlgorithmError: ExecuteUserScriptError: Command "mpirun --host algo-1:8,algo-2:8 -np 16 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -x SMDATAPARALLEL_SERVER_ADDR=algo-1 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3.16xlarge smddprun /opt/conda/bin/python3.6 -m mpi4py run_qa.py --dataset_name squad --do_eval True --do_train True --doc_stride 128 --fp16 True --max_seq_length 384 --max_steps 100 --model_name_or_path bert-large-uncased-whole-word-masking --num_train_epochs 2 --output_dir /opt/ml/model --pad_to_max_length Tr
Did some more experimentation. By turning on debug logging, I was able to see the (json) request used to create the sagemaker training job. I took this json and edited the sagemaker image url used. The updated (and redacted) json is attached. TLDR: training works with this image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04
HOWEVER, the HuggingFace estimator object in the sagemaker python SDK will not "construct" this image url from the python version, pytorch version, and huggingface version, because of the contraints on those values. The image_url must be specified directly
Update: the trained model will NOT deploy to an inference endpoint. I am getting this error:
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
--
bash: no job control in this shell
/usr/local/bin/start_with_right_hostname.sh: line 10: serve: command not found
My guess is because there isn't an inference image for that version of pytorch