federated icon indicating copy to clipboard operation
federated copied to clipboard

Stuck at learning_process.initialize() in DP Tutorial

Open SamuelGong opened this issue 1 year ago • 10 comments

Describe the bug In the colab notebook for DP, everything went well until I reached the code block where the program kept running for hours but prompting no log. Further debugging shows that the program never finished executing the line state = learning_process.initialize().

Environment: The experiment is conducted from scratch using today's TFF (0.51.0). No modification has been made to any part of the notebook.

Expected behavior The execution of the mentioned line should be able to complete, in an acceptable time like at most minutes.

SamuelGong avatar Mar 21 '23 09:03 SamuelGong

Is this a duplicate of #3742?

Please try the new 0.52.0 release (https://github.com/tensorflow/federated/releases/tag/v0.52.0), which was released yesterday to PyPi (https://pypi.org/project/tensorflow-federated/0.52.0/).

ZacharyGarrett avatar Mar 21 '23 14:03 ZacharyGarrett

Sorry 0.51.0 is a typo--in fact, I was using 0.52.0 (so I have emphasized that it was the version released today). Could you please re-investigate that? I have just escaped from #3742 but am now trapped in a new one.

SamuelGong avatar Mar 21 '23 16:03 SamuelGong

To clarify - You can execute code, but the learning_process.initialize() is hanging indefinitely? Do you have any estimate of how long it has run?

zcharles8 avatar Mar 21 '23 17:03 zcharles8

Sure, it was the case. At least three to four hours, and then I lost patience with that. I have tried three times, each of which hung in the same place and no error message was prompted so I could not provide more information.

SamuelGong avatar Mar 22 '23 03:03 SamuelGong

@SamuelGong I think that if you remove the call to tff.backends.native.set_sync_local_cpp_execution_context it should run. This call should now be mainly unnecessary (as it is the default execution context) though it isn't clear why re-setting causes the hang. Can you see if that changes things?

zcharles8 avatar Mar 22 '23 23:03 zcharles8

It works for me! Thank you very much.

SamuelGong avatar Mar 23 '23 04:03 SamuelGong

@SamuelGong I think that if you remove the call to tff.backends.native.set_sync_local_cpp_execution_context it should run. This call should now be mainly unnecessary (as it is the default execution context) though it isn't clear why re-setting causes the hang. Can you see if that changes things?

Since I can now run the tutorial notebook on my local machine, I have access to the jupyter notebook's log. Inspecting on the log, I found that when calling the function tff.backends.native.set_sync_local_cpp_execution_context(), errors like ERROR: Illegal value '3383.0' specified for flag 'max_concurrent_computation_calls' will be prompted in the log. It seems that the expected max_concurrent_computation_calls should be an integer, while the code in the tutorial does not ensure this. I am here to reopen this issue just in case you still not catch the bug.

SamuelGong avatar Mar 29 '23 16:03 SamuelGong

I think that tff.backends.native.set_sync_local_cpp_execution_context shouldn't be invoked in the tutorial at all, since it's now the default. As for the illegal value, this might be due to using Jupyter - I don't think we have any idea about whether it works with TFF or not (and would generally recommend colab instead).

zcharles8 avatar Mar 29 '23 21:03 zcharles8

@SamuelGong I stuck with the same hang issue when i execute state = learning_process.initialize() no error message but execution hang.

even i removed tff.backends.native.set_sync_local_cpp_execution_context but still it did not work. TFF version 0.52.0 and Tf 2.11.0 on my local system

Can you please help? how this can be solved

deepquantum88 avatar Apr 08 '23 19:04 deepquantum88

@SamuelGong I stuck with the same hang issue when i execute state = learning_process.initialize() no error message but execution hang.

even i removed tff.backends.native.set_sync_local_cpp_execution_context but still it did not work. TFF version 0.52.0 and Tf 2.11.0 on my local system

Can you please help? how this can be solved

Hi. For me, previously it was solved by removing the line. However, as TFF is undergoing rapid version change, it may not work now. If not, maybe you should resort to the team.

SamuelGong avatar Apr 11 '23 02:04 SamuelGong

I'm facing the same issue when I use tensoflow federated in google colab. When I try to run tff.federated_computation(lambda: 'Hello, World!')(), this command is also hanging. The same happens with .initialize() function when i try to start training my model using tff.learning.algorithms.build_weighted_fed_avg. Has anyone faced this issue recently?

niharikagupta2021 avatar Mar 27 '24 00:03 niharikagupta2021

@niharikagupta2021 I would encourage you to open a separate github issue for this. Please make sure to include the suggested details - things like version, operating system, etc. are critical to debugging this kind of thing.

zcharles8 avatar Mar 27 '24 01:03 zcharles8