covalent
covalent copied to clipboard
CLI tool freezes when used with Self-Hosted
Environment
- Covalent version:
- Python version: 3.8.13
- Operating system: MacOS Ventura M1
What is happening?
The Covalent CLI tool hangs when the self-hosted dispatcher address and the local server are not started in a certain sequence. We find that the server needs to be started before the self-hosted dispatcher address is set. Furthermore, when the CLI tool is interrupted, we get the following msg in the logs:
Hosting the HTTP server on port 55939 instead
warnings.warn(
2023-04-27 21:11:33,122 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-av4b15si', purging
2023-04-27 21:11:33,134 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-n3571p30', purging
2023-04-27 21:11:33,135 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-tg12mef1', purging
2023-04-27 21:11:33,137 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-pgk81_nz', purging
2023-04-27 21:11:33,138 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-afq5gr_e', purging
2023-04-27 21:11:33,138 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-swkt39i5', purging
2023-04-27 21:11:33,139 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-xbs4wihs', purging
2023-04-27 21:11:33,139 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/65/q5vwfnjd4nbgdb735y8qt9yw0000gn/T/dask-worker-space/worker-pcp8u84v', purging
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
2023-04-27 21:16:08,762 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,769 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,768 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,774 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,782 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,783 - distributed.nanny - ERROR - Worker process died unexpectedly
Exception in thread Nanny stop queue watch:
Traceback (most recent call last):
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/site-packages/distributed/nanny.py", line 884, in watch_stop_q
child_stop_q.close()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/queues.py", line 137, in close
self._reader.close()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 177, in close
self._close()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 361, in _close
_close(self._handle)
OSError: [Errno 9] Bad file descriptor
2023-04-27 21:16:08,789 - distributed.nanny - ERROR - Worker process died unexpectedly
2023-04-27 21:16:08,789 - distributed.nanny - ERROR - Worker process died unexpectedly
Exception in thread Nanny stop queue watch:
Traceback (most recent call last):
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/site-packages/distributed/nanny.py", line 884, in watch_stop_q
child_stop_q.close()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/queues.py", line 137, in close
self._reader.close()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 177, in close
self._close()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/connection.py", line 361, in _close
_close(self._handle)
OSError: [Errno 9] Bad file descriptor
Process LocalDaskCluster:
Traceback (most recent call last):
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/Users/faiyaz/opt/anaconda3/envs/qa/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
How can we reproduce the issue?
Setting the dispatcher address to a self-hosted instance via (while ensuring that the server has not already been started):
import covalent as ct
dispatcher_address = "ec2-54-211-217-217.compute-1.amazonaws.com"
triggers_server_addr = "localhost:48008"
dispatcher_port = "48008"
ct.set_config("dispatcher.address", dispatcher_address)
ct.set_config("dispatcher.port", dispatcher_port)
and then restarting the local covalent server via: covalent start
or covalent start --triggers-only
leads the CLI to hang.
What should happen?
Setting the dispatcher address and then starting the server in the Covalent CLI should not lead to freezing.
Any suggestions?
No response
Hi, I am facing the same issue on local server. Wanted to just set a specific ip address as ususal so that the gui is accessible in from the network but it freezes.
Hi, I am facing the same issue on local server. Wanted to just set a specific ip address as ususal so that the gui is accessible in from the network but it freezes.
Hi @sandipde, you can try setting the address via the dispatcher_addr
field when calling ct.dispatch
. Here's the documentation for setting the dispatcher address without setting it in the config file.
@sandipde if you're using a local server you may also want to set COVALENT_SERVER_IFACE_ANY=1
on that machine before starting the server. Otherwise it will only be exposed to the local loopback interface.