covalent icon indicating copy to clipboard operation
covalent copied to clipboard

CLI tool freezes when used with Self-Hosted

Open FyzHsn opened this issue 1 year ago • 3 comments

Environment

  • Covalent version:
  • Python version: 3.8.13
  • Operating system: MacOS Ventura M1

What is happening?

The Covalent CLI tool hangs when the self-hosted dispatcher address and the local server are not started in a certain sequence. We find that the server needs to be started before the self-hosted dispatcher address is set. Furthermore, when the CLI tool is interrupted, we get the following msg in the logs:

Hosting the HTTP server on port 55939 instead
  warnings.warn(
2023-04-27 21:11:33,122 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-av4b15si', purging
2023-04-27 21:11:33,134 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-n3571p30', purging
2023-04-27 21:11:33,135 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-tg12mef1', purging
2023-04-27 21:11:33,137 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-pgk81_nz', purging
2023-04-27 21:11:33,138 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-afq5gr_e', purging
2023-04-27 21:11:33,138 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-swkt39i5', purging
2023-04-27 21:11:33,139 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-xbs4wihs', purging
2023-04-27 21:11:33,139 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-pcp8u84v', purging
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
2023-04-27 21:16:08,762 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,769 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,768 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,774 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,782 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,783 - distributed.nanny - ERROR - Worker process died unexpectedly
Exception in thread Nanny stop queue watch:
Traceback (most recent call last):
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/site-packages/distributed/nanny.py", line 884, in watch_stop_q
    child_stop_q.close()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/queues.py", line 137, in close
    self._reader.close()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
2023-04-27 21:16:08,789 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,789 - distributed.nanny - ERROR - Worker process died unexpectedly
Exception in thread Nanny stop queue watch:
Traceback (most recent call last):
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/site-packages/distributed/nanny.py", line 884, in watch_stop_q
    child_stop_q.close()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/queues.py", line 137, in close
    self._reader.close()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
Process LocalDaskCluster:
Traceback (most recent call last):
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
    util._exit_function()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

How can we reproduce the issue?

Setting the dispatcher address to a self-hosted instance via (while ensuring that the server has not already been started):

import covalent as ct

dispatcher_address = "ec2-54-211-217-217.compute-1.amazonaws.com"
triggers_server_addr = "localhost:48008"
dispatcher_port = "48008"

ct.set_config("dispatcher.address", dispatcher_address)
ct.set_config("dispatcher.port", dispatcher_port)

and then restarting the local covalent server via: covalent start or covalent start --triggers-only leads the CLI to hang.

What should happen?

Setting the dispatcher address and then starting the server in the Covalent CLI should not lead to freezing.

Any suggestions?

No response

FyzHsn avatar Apr 27 '23 21:04 FyzHsn

Hi, I am facing the same issue on local server. Wanted to just set a specific ip address as ususal so that the gui is accessible in from the network but it freezes.

sandipde avatar May 18 '23 15:05 sandipde

Hi, I am facing the same issue on local server. Wanted to just set a specific ip address as ususal so that the gui is accessible in from the network but it freezes.

Hi @sandipde, you can try setting the address via the dispatcher_addr field when calling ct.dispatch. Here's the documentation for setting the dispatcher address without setting it in the config file.

FyzHsn avatar May 18 '23 15:05 FyzHsn

@sandipde if you're using a local server you may also want to set COVALENT_SERVER_IFACE_ANY=1 on that machine before starting the server. Otherwise it will only be exposed to the local loopback interface.

wjcunningham7 avatar May 26 '23 11:05 wjcunningham7