federated icon indicating copy to clipboard operation
federated copied to clipboard

database is locked

Open aixiangwang opened this issue 2 years ago • 5 comments

When I try to increase the train_clients_per_round ,there is an error is that database is locked.I then individually tested the clients with each error and found that they were accessible. image

aixiangwang avatar Jun 25 '22 07:06 aixiangwang

Hi @aixiangwang. Can you provide the information requested on the new bug template, including:

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Python package versions (e.g., TensorFlow Federated, TensorFlow)
  • A minimal reproduction of the bug.

This looks like some edge case around the SQL-backed datasets TFF provides, but without the information above I'm not certain what's actually going on.

zcharles8 avatar Jun 27 '22 16:06 zcharles8

Hi @aixiangwang. Can you provide the information requested on the new bug template, including:

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Python package versions (e.g., TensorFlow Federated, TensorFlow)
  • A minimal reproduction of the bug.

This looks like some edge case around the SQL-backed datasets TFF provides, but without the information above I'm not certain what's actually going on.

Thank you for your reply.I use tensorflow2.8 and tensorflow-federated0.21 running on the Windows 10. image image you can see client_id = 'f3928_39' in the error appear in lastest 10 randomly selected clients,that make the error.But I found client_id = 'f3928_39' have happened several times before.So I'm very confused.Looking forward to your reply!

aixiangwang avatar Jun 28 '22 02:06 aixiangwang

Could you add the full code that actually causes this bug? Even better, if you can narrow it down to a smaller reproduction of it, that'd be really helpful.

zcharles8 avatar Jun 28 '22 15:06 zcharles8

The complete code example is /tensorflow_federated/ simple_fedavg in the current Github project path.I just increased the train_clients_per_round parameter in emnist_fedavg_main.py from 2 to 10, which represents the number of clients sampled per round. image

Some of the key code in the example is shown below: train_data, test_data = get_emnist_dataset()

def tff_model_fn(): """Constructs a fully initialized model for use in federated averaging.""" keras_model = create_original_fedavg_cnn_model(only_digits=True) loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) metrics = [tf.keras.metrics.SparseCategoricalAccuracy()] return tff.learning.from_keras_model( keras_model, loss=loss, metrics=metrics, input_spec=train_data.element_type_structure)

iterative_process = simple_fedavg_tff.build_federated_averaging_process( tff_model_fn, server_optimizer_fn, client_optimizer_fn) server_state = iterative_process.initialize() keras_model = create_original_fedavg_cnn_model(only_digits=True) for round_num in range(FLAGS.total_rounds): sampled_clients = np.random.choice( train_data.client_ids, size=FLAGS.train_clients_per_round, replace=False) print(sampled_clients) sampled_train_data = [ train_data.create_tf_dataset_for_client(client) for client in sampled_clients ] server_state, train_metrics = iterative_process.next( server_state, sampled_train_data) print(f'Round {round_num}') print(f'\tTraining metrics: {train_metrics}') if round_num % FLAGS.rounds_per_eval == 0: server_state.model.assign_weights_to(keras_model) accuracy = evaluate(keras_model, test_data) print(f'\tValidation accuracy: {accuracy * 100.0:.2f}%')

As shown in the figure below, the dataset loaded successfully and went through several iterations successfully. image However, an error occurs at a later turn, as shown in the figure because the database is locked and the next federated procedure fails. image image Looking forward to your reply!

aixiangwang avatar Aug 01 '22 08:08 aixiangwang

@aixiangwang I have not been able to repro this issue, and we have seen no other reports about this.

I suspect this might be something about the environment you are executing in. A similar type of error occurred in https://github.com/tensorflow/federated/issues/3479, and was because the user had cached the dataset to a locked directory.

zcharles8 avatar Mar 16 '23 15:03 zcharles8