Getting online domains using peers health check
Description
Now, we are running peers health check periodically in background task (this PR).
This PR tries to get the list of online domains using the results of this peers health check feature instead of trying to login the domains as guest clients to check if they are online every time we call sy.domains
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
NetworkRegistry._repr_html_ currently looks like
Problem: Peer health check not working with webserver nodes
and still works with python node
address this comment https://github.com/OpenMined/PySyft/pull/8679#discussion_r1590625144 to fix this issue
Problem: Registry.check_network not working with in-memory webserver gateways (checking network without frontend not working?)
@property
def online_networks(self) -> list[dict]:
networks = self.all_networks
def check_network(network: dict) -> dict[Any, Any] | None:
url = "http://" + network["host_or_ip"] + ":" + str(network["port"]) + "/"
try:
res = requests.get(url, timeout=DEFAULT_TIMEOUT) # nosec
online = "This is a PyGrid Network node." in res.text
except Exception:
online = False
# networks without frontend have a /ping route in 0.7.0
if not online:
try:
ping_url = url + "ping"
res = requests.get(ping_url, timeout=DEFAULT_TIMEOUT) # nosec
online = res.status_code == 200
except Exception:
online = False
if online:
version = network.get("version", None)
# Check if syft version was described in NetworkRegistry
# If it's unknown, try to update it to an available version.
if not version or version == "unknown":
# If not defined, try to ask in /syft/version endpoint (supported by 0.7.0)
try:
version_url = url + "api/v2/metadata"
res = requests.get(version_url, timeout=DEFAULT_TIMEOUT) # nosec
if res.status_code == 200:
network["version"] = res.json()["syft_version"]
else:
network["version"] = "unknown"
except Exception:
network["version"] = "unknown"
return network
return None
TODO:
- [x] Run again
tests/integration/local/gateway_local_test.pyin CI - [x] Make peer heath check works with k8s nodes
New repr does not show ping_status.value
Revert it back to just ping_status
Trying to test with 2 k8s domains failed
Reproduce: First, launch a gateway and domain1 with
CLUSTER_NAME=testgateway1 CLUSTER_HTTP_PORT=9081 tox -e dev.k8s.launch.gateway
CLUSTER_NAME=testdomain1 CLUSTER_HTTP_PORT=9082 tox -e dev.k8s.launch.domain
which works. Then trying to launch a second domain with CLUSTER_NAME=testdomain2 CLUSTER_HTTP_PORT=9083 tox -e dev.k8s.launch.domain which also show that it's successful
Login to the gateway and first domain work
However, failed to login the second domain
Note on some flaky integration k8s network tests after merging the Peers Health Check PR:
Since we now check the peers' health by making a connection and create a guest client on the peer node (using peer_client = peer.client_with_context(context=context) in PeerHealthCheckTask.peer_route_healthcheck which is run in a background thread, the tests in tests/integration/network/gateway_test.py that add and update priority of routes with invalid ports may result in error Failed to establish a connection with. However, this error will not be seen in the main thread that run the test. Furthermore, the flakiness of these tests is because we only run peer_route_healthcheck periodically, so they will only fail when they run at the moment when the invalid routes have the highest priority.
Another issue was that we updated the whole peer object in PeerHealthCheckTask.peer_route_heathcheck which can update the outdated node_routes to the node's store in the background thread. This is fixed in this PR https://github.com/OpenMined/PySyft/pull/8851.
TODO: checking route's validity (e.g. by pinging) before adding it to a list of node routes?
Moved the PR here https://github.com/OpenMined/PySyft/pull/8851 with some bug fixes and improvements