dask-yarn icon indicating copy to clipboard operation
dask-yarn copied to clipboard

EMR bootstrap script fails

Open ResidentMario opened this issue 4 years ago • 18 comments

The EMR bootstrap script currently fails with the following error (found via stderr logs):

+ sudo mv /tmp/jupyter-notebook.conf /etc/init/
mv: cannot create regular file ‘/etc/init/’: Not a directory

ResidentMario avatar Jul 22 '20 17:07 ResidentMario

Thank you for the error report @ResidentMario . My apologies in the delayed response. The folks who maintain this repository have been busy lately.

Do you have any interest in submitting a patch to resolve this issue?

mrocklin avatar Aug 04 '20 02:08 mrocklin

I might be able to look into it, but no promises.

ResidentMario avatar Aug 04 '20 22:08 ResidentMario

I am trying to debug some other issues with this and found that by using the EMR release emr-5.29.0 instead of emr-5.30.1 resolves the problem. It looks like something in the new image is causing the problem. Thought that bit of intel might help.

nmerket avatar Aug 05 '20 18:08 nmerket

Apparently emr-5.30 onwards they only support systemd and no longer support upstart.

hegde-anish avatar Aug 07 '20 18:08 hegde-anish

I think it has to do with Amazon Linux 2:

Amazon Linux 2 support – In EMR version 5.30.0 and later, EMR uses Amazon Linux 2 OS. New custom AMIs (Amazon Machine Image) must be based on the Amazon Linux 2 AMI. For more information, see Using a Custom AMI.

datafuz avatar Oct 01 '20 02:10 datafuz

I tried to make it work with systemd and updated it with the following:

# -----------------------------------------------------------------------------
# 10. Configure Jupyter Notebook
# -----------------------------------------------------------------------------
echo "Configuring Jupyter"
mkdir -p $HOME/.jupyter
HASHED_PASSWORD=`python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))"`
cat <<EOF >> $HOME/.jupyter/jupyter_notebook_config.py
c.NotebookApp.password = u'$HASHED_PASSWORD'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.port = 8889
EOF

# # -----------------------------------------------------------------------------
# # 11. Define an upstart service for the Jupyter Notebook Server
# #
# # This sets the notebook server up to properly run as a background service.
# # -----------------------------------------------------------------------------
echo "Configuring Jupyter Notebook Upstart Service"
cat <<EOF > /tmp/jupyter-notebook.service
[Unit]
Description=Jupyter Notebook

[Service]
ExecStart=$HOME/miniconda/bin/jupyter-notebook --config=$HOME/.jupyter/jupyter_notebook_config.py
Type=simple
PIDFile=/run/jupyter.pid
WorkingDirectory=$HOME
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF
sudo mv /tmp/jupyter-notebook.service /etc/init.d/
sudo systemctl enable /etc/init.d/jupyter-notebook.service

# # -----------------------------------------------------------------------------
# # 12. Start the Jupyter Notebook Server
# # -----------------------------------------------------------------------------
# echo "Starting Jupyter Notebook Server"

sudo systemctl daemon-reload
sudo systemctl restart jupyter-notebook.service

Note: I added a port for the notebook.

This runs on bootstrap but there is nothing on port 8889. When I run the ExecStart command manually, via ssh, the notebook opens. So not sure what I'm doing wrong. I also get the following problem: https://github.com/dask/dask-yarn/issues/124

Sources for the new script:

  1. https://gist.github.com/klingtnet/76c542613e544a13bb7ad741b53f1f73
  2. https://medium.com/@joelzhang/setting-up-jupyter-notebook-server-as-service-in-ubuntu-16-04-116cf8e84781

EMR version 5.31.0 Hadoop distribution:Amazon 2.10.0 Python: 3.7.9

hamzahiqb avatar Oct 16 '20 20:10 hamzahiqb

Hi @hiqbal2, Your script for systemd was super helpful. I got it to work by doing a few changes to this script.

  1. ExecStart=$HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/jupyter_notebook_config.py
  2. sudo mv /tmp/jupyter-notebook.service /etc/systemd/system/
  3. sudo systemctl enable jupyter-notebook.service

I hope this helps

hegde-anish avatar Oct 17 '20 05:10 hegde-anish

@hegde-anish thanks for the help. EMR seems to bootstrap properly now. However, not sure if you got this error when trying to start a dask cluster:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-762253d83df2> in <module>
      1 # Create a cluster
----> 2 cluster = YarnCluster()
      3 
      4 # Connect to the cluster
      5 client = Client(cluster)

/home/hadoop/miniconda/lib/python3.8/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    366         loop=None,
    367     ):
--> 368         spec = _make_specification(
    369             environment=environment,
    370             n_workers=n_workers,

/home/hadoop/miniconda/lib/python3.8/site-packages/dask_yarn/core.py in _make_specification(**kwargs)
    184             "See http://yarn.dask.org/environments.html for more information."
    185         )
--> 186         raise ValueError(msg)
    187 
    188     n_workers = lookup(kwargs, "n_workers", "yarn.worker.count")

ValueError: You must provide a path to a Python environment for the workers.
This may be one of the following:
- A conda environment archived with conda-pack
- A virtual environment archived with venv-pack
- A path to a conda environment, specified as conda://...
- A path to a virtual environment, specified as venv://...
- A path to a python binary to use, specified as python://...

See http://yarn.dask.org/environments.html for more information.

Not sure why this is happening.

I am also not sure if there is a difference in behaviour in just calling $HOME/miniconda/bin/jupyter-notebook vs the original script: exec su - hadoop -c "jupyter notebook". When I try the old command i get the error that hadoop -c does not exists.

I don't have any experience with hadoop or dask, so am a little lost on debugging this.

hamzahiqb avatar Oct 17 '20 11:10 hamzahiqb

This modified bootstrap script worked for me, with a few additional fixes:

  • conda pack failed with python=3.8.5 (see #133), so I specified a 3.7 version
  • My conda environment already contained tornado 6.1, which I found worked with jupyter-server-proxy 1.5.2 without issue (despite the comment in the script saying otherwise)
  • The AMI I used (EMR 5.32) contains aliases for python -> /usr/bin/python3 and pip -> /usr/bin/pip3 in /etc/bashrc (which gets imported into $HOME/.bashrc). This interferes with conda, since we want python -> ~/miniconda/bin/python
  • I also ran into the ValueError: You must provide a path to a Python environment for the workers issue that @hiqbal2 encountered. The root cause (no pun intended) is that the notebook server is running as root instead of the hadoop user.

To fix the latter two issues, I added unalias commands to ~/.bashrc before sourceing it, which feels like a bit of a hack:

# -----------------------------------------------------------------------------
# 2. Install Miniconda
# -----------------------------------------------------------------------------
echo "Installing Miniconda"
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda
rm /tmp/miniconda.sh
echo -e 'unalias python || true' >> $HOME/.bashrc
echo -e 'unalias pip || true' >> $HOME/.bashrc
echo -e '\nexport PATH=$HOME/miniconda/bin:$PATH' >> $HOME/.bashrc
source $HOME/.bashrc
conda update conda -y

and I specified a User in the systemd [Service] section (which also let me remove the --allow-root flag that @hegde-anish suggested). I also had to export the JAVA_HOME environment variable:

# -----------------------------------------------------------------------------
# 11. Define an upstart service for the Jupyter Notebook Server
#
# This sets the notebook server up to properly run as a background service.
# -----------------------------------------------------------------------------
echo "Configuring Jupyter Notebook Upstart Service"
cat <<EOF > /tmp/jupyter-notebook.service
[Unit]
Description=Jupyter Notebook

[Service]
User=hadoop
ExecStart=$HOME/miniconda/bin/jupyter-notebook --config=$HOME/.jupyter/jupyter_notebook_config.py
Environment=JAVA_HOME=$JAVA_HOME
Type=simple
PIDFile=/run/jupyter.pid
WorkingDirectory=$HOME
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF
sudo mv /tmp/jupyter-notebook.service /etc/systemd/system/
sudo systemctl enable jupyter-notebook


# -----------------------------------------------------------------------------
# 12. Start the Jupyter Notebook Server
# -----------------------------------------------------------------------------
echo "Starting Jupyter Notebook Server"
sudo systemctl daemon-reload
sudo systemctl start jupyter-notebook

EMR version 5.32.0 Hadoop distribution: Amazon 2.10.1 Python 3.7.6

kqshan avatar Dec 18 '20 22:12 kqshan

The above worked for me. However, the jupyter notebook now just does not output any values. I tried to start the notebook via ssh and got the following error when trying to do a simple 2+2:

[E 12:12:00.355 NotebookApp] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fc061beb4d0>, <Future finished exception=TimeoutError('Timeout')>)
    Traceback (most recent call last):
      File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
        ret = callback()
      File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
        return fn(*args, **kwargs)
      File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/websocket.py", line 553, in <lambda>
        self.stream.io_loop.add_future(result, lambda f: f.result())
    tornado.util.TimeoutError: Timeout
ERROR:asyncio:Future exception was never retrieved
future: <Future finished exception=TimeoutError('Timeout')>
Traceback (most recent call last):
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/websocket.py", line 757, in _accept_connection
    yield open_result
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/websocket.py", line 553, in <lambda>
    self.stream.io_loop.add_future(result, lambda f: f.result())
tornado.util.TimeoutError: Timeout

hamzahiqb avatar Dec 29 '20 12:12 hamzahiqb

@kqshan this is great, thanks.

I didn't find I needed to to unalias, after the bootstrap I had proper pointers to miniconda python/pip. I'm running a newer EMR (emr-6.2.0) so this may be a factor.

I removed the version pin for tornado as well.

The conda pack issue appears to be from this conda issue. I added --ignore-missing-files and it resolved although I don't know if I'll hit environment synchronization issues with my workers as a result (haven't gotten that far in testing yet)

Also the version spec for dask-yarn causes a file to be written to the home folder called ''=0.7.0". Some escaping or quoting likely necessary to fix but I just removed the version specification because conda installed 0.8.1 on its own.

davegravy avatar Jan 10 '21 23:01 davegravy

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

tjburrows avatar Mar 09 '21 22:03 tjburrows

What version of conda-pack is used ? I believe 0.6 was released a month ago

quasiben avatar Mar 09 '21 22:03 quasiben

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

My script works with emr-6.2.0, yes. I've had no issues with dask & EMR, at least not that I can attribute to the bootstrap.

davegravy avatar Mar 11 '21 14:03 davegravy

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

My script works with emr-6.2.0, yes. I've had no issues with dask & EMR, at least not that I can attribute to the bootstrap.

Can you share it?

tjburrows avatar Mar 11 '21 14:03 tjburrows

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

My script works with emr-6.2.0, yes. I've had no issues with dask & EMR, at least not that I can attribute to the bootstrap.

Can you share it?

Sure:

https://gist.github.com/davegravy/61e3abb81176f4490032554b70d28c31

davegravy avatar Mar 11 '21 17:03 davegravy

Hello, i tried install dask with many versions of bootstrap and EMR versions but anything doesnt work. If it's possible share with me what EMR version and dask bootstrap you used. Thanks @davegravy in your bootstrap the line 125 is censured "Downloading pyquis step".

gabriel131188 avatar Aug 13 '21 18:08 gabriel131188

Hello, i tried install dask with many versions of bootstrap and EMR versions but anything doesnt work. If it's possible share with me what EMR version and dask bootstrap you used. Thanks

Hi I was using EMR 6.2.0.

@davegravy in your bootstrap the line 125 is censured "Downloading pyquis step".

This is a private python library my bootstrap script installs. It shouldn't have any bearing on the bootstrap's ability to succeed.

davegravy avatar Aug 17 '21 16:08 davegravy