federated
federated copied to clipboard
Stuck at learning_process.initialize() in DP Tutorial
Describe the bug
In the colab notebook for DP, everything went well until I reached the code block where the program kept running for hours but prompting no log. Further debugging shows that the program never finished executing the line state = learning_process.initialize()
.
Environment: The experiment is conducted from scratch using today's TFF (0.51.0). No modification has been made to any part of the notebook.
Expected behavior The execution of the mentioned line should be able to complete, in an acceptable time like at most minutes.
Is this a duplicate of #3742?
Please try the new 0.52.0
release (https://github.com/tensorflow/federated/releases/tag/v0.52.0), which was released yesterday to PyPi (https://pypi.org/project/tensorflow-federated/0.52.0/).
Sorry 0.51.0
is a typo--in fact, I was using 0.52.0
(so I have emphasized that it was the version released today). Could you please re-investigate that? I have just escaped from #3742 but am now trapped in a new one.
To clarify - You can execute code, but the learning_process.initialize()
is hanging indefinitely? Do you have any estimate of how long it has run?
Sure, it was the case. At least three to four hours, and then I lost patience with that. I have tried three times, each of which hung in the same place and no error message was prompted so I could not provide more information.
@SamuelGong I think that if you remove the call to tff.backends.native.set_sync_local_cpp_execution_context
it should run. This call should now be mainly unnecessary (as it is the default execution context) though it isn't clear why re-setting causes the hang. Can you see if that changes things?
It works for me! Thank you very much.
@SamuelGong I think that if you remove the call to
tff.backends.native.set_sync_local_cpp_execution_context
it should run. This call should now be mainly unnecessary (as it is the default execution context) though it isn't clear why re-setting causes the hang. Can you see if that changes things?
Since I can now run the tutorial notebook on my local machine, I have access to the jupyter notebook's log. Inspecting on the log, I found that when calling the function tff.backends.native.set_sync_local_cpp_execution_context()
, errors like ERROR: Illegal value '3383.0' specified for flag 'max_concurrent_computation_calls'
will be prompted in the log. It seems that the expected max_concurrent_computation_calls
should be an integer, while the code in the tutorial does not ensure this. I am here to reopen this issue just in case you still not catch the bug.
I think that tff.backends.native.set_sync_local_cpp_execution_context
shouldn't be invoked in the tutorial at all, since it's now the default. As for the illegal value, this might be due to using Jupyter - I don't think we have any idea about whether it works with TFF or not (and would generally recommend colab instead).
@SamuelGong I stuck with the same hang issue when i execute state = learning_process.initialize() no error message but execution hang.
even i removed tff.backends.native.set_sync_local_cpp_execution_context but still it did not work. TFF version 0.52.0 and Tf 2.11.0 on my local system
Can you please help? how this can be solved
@SamuelGong I stuck with the same hang issue when i execute state = learning_process.initialize() no error message but execution hang.
even i removed tff.backends.native.set_sync_local_cpp_execution_context but still it did not work. TFF version 0.52.0 and Tf 2.11.0 on my local system
Can you please help? how this can be solved
Hi. For me, previously it was solved by removing the line. However, as TFF is undergoing rapid version change, it may not work now. If not, maybe you should resort to the team.
I'm facing the same issue when I use tensoflow federated in google colab. When I try to run tff.federated_computation(lambda: 'Hello, World!')(), this command is also hanging. The same happens with .initialize() function when i try to start training my model using tff.learning.algorithms.build_weighted_fed_avg. Has anyone faced this issue recently?
@niharikagupta2021 I would encourage you to open a separate github issue for this. Please make sure to include the suggested details - things like version, operating system, etc. are critical to debugging this kind of thing.