PySyft icon indicating copy to clipboard operation
PySyft copied to clipboard

Getting online domains using peers health check

Open khoaguin opened this issue 1 year ago • 8 comments

Description

Now, we are running peers health check periodically in background task (this PR).
This PR tries to get the list of online domains using the results of this peers health check feature instead of trying to login the domains as guest clients to check if they are online every time we call sy.domains

khoaguin avatar May 09 '24 04:05 khoaguin

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

NetworkRegistry._repr_html_ currently looks like

image

khoaguin avatar May 10 '24 05:05 khoaguin

Problem: Peer health check not working with webserver nodes

image and still works with python node image

address this comment https://github.com/OpenMined/PySyft/pull/8679#discussion_r1590625144 to fix this issue

khoaguin avatar May 10 '24 05:05 khoaguin

Problem: Registry.check_network not working with in-memory webserver gateways (checking network without frontend not working?)

    @property
    def online_networks(self) -> list[dict]:
        networks = self.all_networks

        def check_network(network: dict) -> dict[Any, Any] | None:
            url = "http://" + network["host_or_ip"] + ":" + str(network["port"]) + "/"
            try:
                res = requests.get(url, timeout=DEFAULT_TIMEOUT)  # nosec
                online = "This is a PyGrid Network node." in res.text
            except Exception:
                online = False

            # networks without frontend have a /ping route in 0.7.0
            if not online:
                try:
                    ping_url = url + "ping"
                    res = requests.get(ping_url, timeout=DEFAULT_TIMEOUT)  # nosec
                    online = res.status_code == 200
                except Exception:
                    online = False

            if online:
                version = network.get("version", None)
                # Check if syft version was described in NetworkRegistry
                # If it's unknown, try to update it to an available version.
                if not version or version == "unknown":
                    # If not defined, try to ask in /syft/version endpoint (supported by 0.7.0)
                    try:
                        version_url = url + "api/v2/metadata"
                        res = requests.get(version_url, timeout=DEFAULT_TIMEOUT)  # nosec
                        if res.status_code == 200:
                            network["version"] = res.json()["syft_version"]
                        else:
                            network["version"] = "unknown"
                    except Exception:
                        network["version"] = "unknown"
                return network
            return None

khoaguin avatar May 10 '24 05:05 khoaguin

TODO:

  • [x] Run again tests/integration/local/gateway_local_test.py in CI
  • [x] Make peer heath check works with k8s nodes

khoaguin avatar May 10 '24 10:05 khoaguin

New repr does not show ping_status.value image

Revert it back to just ping_status image

khoaguin avatar May 14 '24 02:05 khoaguin

Trying to test with 2 k8s domains failed

Reproduce: First, launch a gateway and domain1 with

CLUSTER_NAME=testgateway1 CLUSTER_HTTP_PORT=9081 tox -e dev.k8s.launch.gateway
CLUSTER_NAME=testdomain1 CLUSTER_HTTP_PORT=9082 tox -e dev.k8s.launch.domain

which works. Then trying to launch a second domain with CLUSTER_NAME=testdomain2 CLUSTER_HTTP_PORT=9083 tox -e dev.k8s.launch.domain which also show that it's successful image

Login to the gateway and first domain work image

However, failed to login the second domain image

khoaguin avatar May 14 '24 03:05 khoaguin

Note on some flaky integration k8s network tests after merging the Peers Health Check PR:

Since we now check the peers' health by making a connection and create a guest client on the peer node (using peer_client = peer.client_with_context(context=context) in PeerHealthCheckTask.peer_route_healthcheck which is run in a background thread, the tests in tests/integration/network/gateway_test.py that add and update priority of routes with invalid ports may result in error Failed to establish a connection with. However, this error will not be seen in the main thread that run the test. Furthermore, the flakiness of these tests is because we only run peer_route_healthcheck periodically, so they will only fail when they run at the moment when the invalid routes have the highest priority.

Another issue was that we updated the whole peer object in PeerHealthCheckTask.peer_route_heathcheck which can update the outdated node_routes to the node's store in the background thread. This is fixed in this PR https://github.com/OpenMined/PySyft/pull/8851.

TODO: checking route's validity (e.g. by pinging) before adding it to a list of node routes?

khoaguin avatar May 15 '24 02:05 khoaguin

Screenshot from 2024-05-27 13-30-55

khoaguin avatar May 27 '24 06:05 khoaguin

Moved the PR here https://github.com/OpenMined/PySyft/pull/8851 with some bug fixes and improvements

khoaguin avatar May 27 '24 06:05 khoaguin