course-v3
course-v3 copied to clipboard
SageMaker CloudFormation stack won't start
My CloudFormation stack has not been starting up successfully since Sunday. When I try to start the SageMaker Notebook, it sits in Pending for a while then goes to Failed. The error message is:
Notebook Instance Lifecycle Config 'arn:aws:sagemaker:ap-southeast-2:xxxxxxx:notebook-instance-lifecycle-config/fastainblifecycleconfig-xxxxx' for Notebook Instance 'arn:aws:sagemaker:ap-southeast-2:xxxxxxx:notebook-instance/fastai' took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet access. |
-- | --
Looking at the logs, I just see these log lines:
Creating symlinks
Install a new kernel for fastai with name 'Python 3'
Installed kernelspec fastai in /home/ec2-user/.local/share/jupyter/kernels/fastai
Update fastai library
I've also tried to start up a couple of new stacks in a few different regions using the templates linked in the course here but I get the same error. I'm pretty new to fastai so I'm not really sure what the problem is, apologies if this is the wrong place to report the issue.
Having the exact same issue on us-east-1. Completely disables my ability to train models in a reasonable time. See below for work around.
AWS has been super helpful in diagnosing and solving so far.
Using fastai's start and create scripts from the notebook lifecycle configuration tab in AWS console to get it working again.
Issue Cause
Getting past line 17 of the start
script takes 12 minutes on a p3.xl, the total start
script took 17 minutes. This exceeds the 5 minute timeout.
Quick and dirty to get working again (this MIGHT not persist all the installs and MIGHT need to be done every time you start)
- Create a new, bare notebook instance with no start/create scripts. Can use same permissions as previous
- Launch Jupyter Notebook.
- open a terminal in Jupyter (under New), then paste the
create
script below. Its exactly the script from fastai's lifecycle config. The create script runs in a few seconds. - After that paste the 1st piece of the modified fastai
start
script below in to the jupyter terminal. It will take ~20 minutes, most of that time is Anaconda solving the environment. The last step kills the terminal. - Refresh the page and the terminal should restart. Paste the next piece of to finish the start script.
- load whatever you normally would to get sagemaker working
create script:
sudo -H -i -u ec2-user bash << EOF
# create symlinks to EBS volume
echo "Creating symlinks"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
# clone the course notebooks
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3
echo "Finished running onCreate script"
start script part 1
sudo -H -i -u ec2-user bash << EOF
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "Install a new kernel for fastai with name 'Python 3'"
source /home/ec2-user/anaconda3/bin/activate pytorch_p36
python -m ipykernel install --name 'fastai' --display-name 'Python 3' --user
# uncomment if you want to update PyTorch on every start
#echo "Update PyTorch library"
#conda install -y pytorch torchvision -c pytorch
echo "Update fastai library"
conda install -y fastai -c fastai
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
pip install jupyter_contrib_nbextensions
jupyter contrib nbextensions install --user
echo "Restarting jupyter notebook server"
# Kills the jupyter terminal, requires you refresh the page
pkill -f jupyter-notebook
start script part 2
echo "Getting latest version of fastai course"
cd /home/ec2-user/SageMaker/course-v3
git pull
echo "Finished running onStart script"
AWS said they'll send me scripts tonight that will persist the changes by making a new notebook instance then creating a new conda environment stored in the /SageMaker/ folder. This will then persist changes, you'd just need to run the scripts once on startup. It doesnt solve how nice the CloudFormation method is though.
@bonnici, @mattmcclean
Thanks so much for the workaround, I'll give that a go tonight and hopefully I can keep going on the course.
Steps for what appears to be a permanent fix, though not thoroughly tested.
Tested this by setting up and running the lesson1-pets.ipynb up through the first .save()
-
Create a new notebook instance. Choose the necessary settings per your desire(GPU instance, 50GB storage etc). Once created, open Jupyter
-
Upload the shell script setupKernel.sh (below). Open a Jupyter Terminal(New -> Terminal), run this:
cd SageMaker
chmod +x ./setupKernel.sh
setupKernel.sh
#!/bin/bash
echo "Creating new kernel"
conda create -y --prefix /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai python=3.6 ipykernel
source activate /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai
echo "Creating symlinks"
[ ! -L "/home/ec2-user/.torch" ] && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
[ ! -L "/home/ec2-user/.fastai" ] && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "Creating .fastai and .torch folders"
mkdir /home/ec2-user/SageMaker/.torch && ln -s /home/ec2-user/SageMaker/.torch /home/ec2-user/.torch
mkdir /home/ec2-user/SageMaker/.fastai && ln -s /home/ec2-user/SageMaker/.fastai /home/ec2-user/.fastai
echo "Update fastai library"
conda install -y fastai -c fastai
echo "Update torchvision library"
conda install -y pytorch-gpu torchvision -c anaconda
echo "Clone the course repo"
git clone https://github.com/fastai/course-v3.git /home/ec2-user/SageMaker/course-v3
echo "Finished running onStart script"
This takes roughly 15 or so minutes. Maybe less.
-
This should create a new kernel for you, called sm-fastai, with the necessary libraries installed(including fastai). This will also close the fastai github
-
Open one of the example notebooks from fastai/create new notebook. Under Kernel -> Change Kernel, you should be able to locate conda_sm-fastai. Try using it to make sure everything is good.
-
Stop the notebook instance, create a new lifecycle configuration policy and add the following to its OnStart portion
#!/bin/bash
set -e
sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai
# Create symlink to kernel
ln -s /home/ec2-user/SageMaker/anaconda3/envs/sm-fastai /home/ec2-user/anaconda3/envs/sm-fastai
EOF
Make sure #!/bin/bash is on Line1. It cannot be on line2 and it cannot have any spaces around it. If you copy and paste this on windows, make sure you're using Unix style line endings. Using Notepad ++ go to edit->EOF Conversion->Unix to convert
- Attach this lifecycle policy to your notebook instance and start it. You should now be able to see your kernel as normal, going forward. The start times are much faster and the policy will never fail due to a 5 minute timeout since all the work has already been done and is permanently stored under the /home/ec2-user/SageMaker/ directory
For any brand new instances created after this, you need to follow these steps once for each new instance, but never again after(can just stop/start as normal then).
Thanks - seems to be working. I did run into that line ending issue on step 1 but copying and pasting it into an editor in the terminal worked. I'm planning to run through the lesson 3 notebooks today so if those all work I think we're in business.
If it's all good I might see if I can copy all this stuff into a lifecycle configuration and make a new CloudFormation template.
Edit: Actually I started getting issues with not being able to select the kernel. What I ended up just doing was commenting out the line:
conda install -y fastai -c fastai
in my original notebook's startup lifecycle configuration, then just running that in a new terminal after I started up the notebook. That way I also get to keep all my old data files etc.
Need to update sagemaker-cfg.yml i created the following fix but i am unable to push it as a fix
--- a/docs/setup/sagemaker-cfn.yml +++ b/docs/setup/sagemaker-cfn.yml @@ -77,7 +77,7 @@ Resources: #conda install -y pytorch torchvision -c pytorch
echo "Update fastai library"
- conda install -y fastai -c fastai + nohup /home/ec2-user/anaconda3/bin/conda install -y fastai -c fastai -v &
echo "Install jupyter nbextension"
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv