federated icon indicating copy to clipboard operation
federated copied to clipboard

Running custom remote executors

Open N520 opened this issue 3 years ago • 6 comments

Hi,

I have been trying to get some remote executors running using this guide and I seem to be stuck. I was not able to obtain the required remote executor service docker image (all while following the mentioned guide to the letter)

Having failed at that, I started experimenting with executor_factories and servers:

import tensorflow_federated as tff

executor_factory = tff.framework.local_executor_factory(default_num_clients=1, max_fanout=100)
tff.simulation.run_server(executor_factory, 10, 8081, None)

This allows me to reach the executor in the training loop with the following setup, although i keep getting grpc INVALID_ARGUMENT exceptions and no result.

import grpc
import tensorflow as tf
import tensorflow_federated as tff

ip_address = "172.19.0.2"  # ip adress of worker. Cannot use container name due to grpc
port = 8081
GRPC_OPTIONS = [('grpc.max_message_length', 20 * 1024 * 1024),
                    ('grpc.max_receive_message_length', 20 * 1024 * 1024)]
channels = [grpc.insecure_channel(f"{ip_address}:{port}", options=GRPC_OPTIONS)]
tff.backends.native.set_remote_execution_context(channels)

So, what i am asking is, is it possible to have a "local" executor in python like this and, furthermore, what is going on with the remote executor image referenced in the tensorflow federated docs? Any input is appreciated

N520 avatar Oct 13 '21 13:10 N520

Any detailed error messages about "not able to obtain the required remote executor service docker image"?

This might be related to https://github.com/tensorflow/federated/issues/1958 Which version of TFF are you running?

wennanzhu avatar Oct 14 '21 19:10 wennanzhu

On the more-manual approach: your client / server code looks fine to me--I'd love to see your stacktraces for the INVALID_ARGUMENT error.

On the docker image: I believe the tutorial and the image itself were written a fairly long time ago, and looking at our GCP page it's not clear to me that the image still exists. I'll ask around in the project, but perhaps @michaelreneer knows whether our server image is still around?

jkr26 avatar Oct 14 '21 20:10 jkr26

@wennanzhu I am running the current tff version 0.19.0. In terms of the docker image I get an error saying the image, was not found:

$ docker run --rm -p 8000:8000 gcr.io/tensorflow-federated/remote-executor-service:latest

Unable to find image 'gcr.io/tensorflow-federated/remote-executor-service:latest' locally
docker: Error response from daemon: manifest for gcr.io/tensorflow-federated/remote-executor-service:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/tensorflow-federated/remote-executor-service/manifests/latest".

I am able to pull the gcr.io/google-samples/hello-app:1.0 image though, meaning I can access the registry.

@jkr26 After integrating the client data in my use case I am now unable to reproduce the initial error (which is progress I guess?), yet my server now claims he cannot connect to a sqllite database, which makes no sense to me:

2021-10-20 10:11:10.686876: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-10-20 10:11:10.686953: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-10-20 10:11:14.671447: W tensorflow/core/kernels/data/experimental/sql_dataset_op.cc:209] Failed to connect to database: Invalid argument: Sqlite::Open(/root/.tff/emnist_all.sqlite) failed: unable to open database file

I guess the cuda errors can be ignored since I am not using a GPU. The code posted in the initial question has not been changed.

Any help is deeply appreciated.

N520 avatar Oct 20 '21 10:10 N520

As far as we can tell, our Docker image seems to be gone--apologies. Filing a bug to create a new release and ensure the image is pushed.

On the inability to read the dataset--TFF's hosted datasets are represented externally as SQLite databases, which are read via tf.data's SQLDataset op. I'm not sure we've seen this fail before.

One thing I wonder--perhaps this file doesn't exist? IIRC, a Python load_data call is responsible for downloading the appropriate database file. It is possible that in our external remote runtime, this call is happening on your controller, effectively downloading and caching that file locally there, but the same filename does not exist on your remote machine.

You might be able to work around this (in a very ugly way) by calling the appropriate load_data function on your worker before it starts serving. We'll have to think about the best way to handle this in the OSS world. Maybe @ZacharyGarrett has ideas?

jkr26 avatar Oct 21 '21 18:10 jkr26

@jkr26 Okay that makes sense and solves my problem on the client-side.

As you said, i need to manually load the data at the clients again, as seen below:

    _, _ = tff.simulation.datasets.emnist.load_data()
    executor_factory = tff.framework.local_executor_factory(default_num_clients=1, max_fanout=100)
    tff.simulation.run_server(executor_factory, 10, 8081, None)

The first line loads the necessary emnist_all.sqlite file into the /root/.tff directory, which is the same on my "server" and everything works like a charm.

This might go a little off-topic now, but how am I supposed to deal with non provided datasets? Do I have to provide my own implementation for ClientData? And am I assuming correctly that the paths of the files used by the client and server have to be the same then (e.g. /root/.tff) for the client to load it then?

N520 avatar Oct 27 '21 12:10 N520

@jkr26 after resolving the current issue, were you able to resolve AttributeError: 'collections.OrderedDict' object has no attribute 'loss' error ?

shubham-ai avatar Jan 11 '22 15:01 shubham-ai