federated
federated copied to clipboard
Running custom remote executors
Hi,
I have been trying to get some remote executors running using this guide and I seem to be stuck. I was not able to obtain the required remote executor service docker image (all while following the mentioned guide to the letter)
Having failed at that, I started experimenting with executor_factories and servers:
import tensorflow_federated as tff
executor_factory = tff.framework.local_executor_factory(default_num_clients=1, max_fanout=100)
tff.simulation.run_server(executor_factory, 10, 8081, None)
This allows me to reach the executor in the training loop with the following setup, although i keep getting grpc INVALID_ARGUMENT exceptions and no result.
import grpc
import tensorflow as tf
import tensorflow_federated as tff
ip_address = "172.19.0.2" # ip adress of worker. Cannot use container name due to grpc
port = 8081
GRPC_OPTIONS = [('grpc.max_message_length', 20 * 1024 * 1024),
('grpc.max_receive_message_length', 20 * 1024 * 1024)]
channels = [grpc.insecure_channel(f"{ip_address}:{port}", options=GRPC_OPTIONS)]
tff.backends.native.set_remote_execution_context(channels)
So, what i am asking is, is it possible to have a "local" executor in python like this and, furthermore, what is going on with the remote executor image referenced in the tensorflow federated docs? Any input is appreciated
Any detailed error messages about "not able to obtain the required remote executor service docker image"?
This might be related to https://github.com/tensorflow/federated/issues/1958 Which version of TFF are you running?
On the more-manual approach: your client / server code looks fine to me--I'd love to see your stacktraces for the INVALID_ARGUMENT
error.
On the docker image: I believe the tutorial and the image itself were written a fairly long time ago, and looking at our GCP page it's not clear to me that the image still exists. I'll ask around in the project, but perhaps @michaelreneer knows whether our server image is still around?
@wennanzhu I am running the current tff version 0.19.0. In terms of the docker image I get an error saying the image, was not found:
$ docker run --rm -p 8000:8000 gcr.io/tensorflow-federated/remote-executor-service:latest
Unable to find image 'gcr.io/tensorflow-federated/remote-executor-service:latest' locally
docker: Error response from daemon: manifest for gcr.io/tensorflow-federated/remote-executor-service:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/tensorflow-federated/remote-executor-service/manifests/latest".
I am able to pull the gcr.io/google-samples/hello-app:1.0
image though, meaning I can access the registry.
@jkr26 After integrating the client data in my use case I am now unable to reproduce the initial error (which is progress I guess?), yet my server now claims he cannot connect to a sqllite database, which makes no sense to me:
2021-10-20 10:11:10.686876: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-10-20 10:11:10.686953: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-10-20 10:11:14.671447: W tensorflow/core/kernels/data/experimental/sql_dataset_op.cc:209] Failed to connect to database: Invalid argument: Sqlite::Open(/root/.tff/emnist_all.sqlite) failed: unable to open database file
I guess the cuda errors can be ignored since I am not using a GPU. The code posted in the initial question has not been changed.
Any help is deeply appreciated.
As far as we can tell, our Docker image seems to be gone--apologies. Filing a bug to create a new release and ensure the image is pushed.
On the inability to read the dataset--TFF's hosted datasets are represented externally as SQLite databases, which are read via tf.data's SQLDataset op. I'm not sure we've seen this fail before.
One thing I wonder--perhaps this file doesn't exist? IIRC, a Python load_data
call is responsible for downloading the appropriate database file. It is possible that in our external remote runtime, this call is happening on your controller, effectively downloading and caching that file locally there, but the same filename does not exist on your remote machine.
You might be able to work around this (in a very ugly way) by calling the appropriate load_data
function on your worker before it starts serving. We'll have to think about the best way to handle this in the OSS world. Maybe @ZacharyGarrett has ideas?
@jkr26 Okay that makes sense and solves my problem on the client-side.
As you said, i need to manually load the data at the clients again, as seen below:
_, _ = tff.simulation.datasets.emnist.load_data()
executor_factory = tff.framework.local_executor_factory(default_num_clients=1, max_fanout=100)
tff.simulation.run_server(executor_factory, 10, 8081, None)
The first line loads the necessary emnist_all.sqlite
file into the /root/.tff
directory, which is the same on my "server" and everything works like a charm.
This might go a little off-topic now, but how am I supposed to deal with non provided datasets? Do I have to provide my own implementation for ClientData
? And am I assuming correctly that the paths of the files used by the client and server have to be the same then (e.g. /root/.tff
) for the client to load it then?
@jkr26 after resolving the current issue, were you able to resolve AttributeError: 'collections.OrderedDict' object has no attribute 'loss' error ?