clay icon indicating copy to clipboard operation
clay copied to clipboard

"Cannot start hyperg" while it's already running

Open etam opened this issue 4 years ago • 2 comments

Description

Golem Version: f5a985ec7d8456e80448b35fe9c31840136d329c

OS: Linux

Branch: b0.23

Reproducible: sometimes

Description of the issue:

When starting golem with hyperg already running it usually correctly detects it, but sometimes fails with

2020-03-21 04:13:33 CRITICAL golem.client                        Can't start network. Giving up.
Traceback (most recent call last):
  File "/home/buildbot-worker/worker/test_node_integration/build/golem/client.py", line 373, in start
    self.start_network()
  File "/home/buildbot-worker/worker/test_node_integration/build/golem/client.py", line 480, in start_network
    self.daemon_manager.start()
  File "/home/buildbot-worker/worker/test_node_integration/build/golem/network/hyperdrive/daemon_manager.py", line 116, in start
    return self._start()
  File "/home/buildbot-worker/worker/test_node_integration/build/golem/report.py", line 173, in wrapper
    return func(*args, **kwargs)
  File "/home/buildbot-worker/worker/test_node_integration/build/golem/network/hyperdrive/daemon_manager.py", line 138, in _start
    raise RuntimeError("Cannot start {}".format(self._executable))
RuntimeError: Cannot start hyperg

Actual result:

Golem fails to start.

Steps To Reproduce

  1. Start hyperg
  2. Start golem

Expected behavior

Golem should always detect running hyperg.

Logs and any additional context

https://buildbot.golem.network/buildbot/#builders/15/builds/979 (test test_task_timeout) https://buildbot.golem.network/buildbot/#/builders/15/builds/981 (test test_frame_restart)

etam avatar Mar 24 '20 09:03 etam

Hypothesis: Before starting hyperg, golem tries to connect to potentially existing one. This might be undefined behaviored by twisted, if called from thread.

etam avatar Mar 27 '20 09:03 etam

AFAIR the node_integration_tests are responsible for starting their own hyperg i think it can be related to:

  • not properly closing hyperg after test1, making test2 fail ( zombie-g )
  • race when starting the hyperg on the same machine from multiple nodes at the same time

maaktweluit avatar Mar 27 '20 11:03 maaktweluit