framework-reproducibility
framework-reproducibility copied to clipboard
[WIP] Feature/patch/softmax cross entropy with logits
Addressing previously reported tensorflow issue #38185. Also tied to tfdeterminism
issues #19, #14, #9.
Fused implementation of {sparse_,}softmax_cross_entropy_with_logits
exhibits non-determinism in its backprop.
This PR will provide a patch that maintains the same function interface but computes the computational steps involved in a non-fused way that is fully deterministic.
Comments above. Heads-up, before you're done, please remember to implement test/test_fused_softmax_cross_entropy.py
in the spirit of test/test_patch_bias_add.py
. This will use the appropriate forward code from the upstream TensorFlow repo along with the deterministic gradient test code that I provided in TensorFlow Issue 38185 (also testing all the modified op bindings).
I'm going to work on adding enable_determinism
, from which _patch_fused_softmax_cross_entropy
can be called.
I just added enable_determinism
with this commit. You should use enable_determinism
to apply the patch in test/test_patch_fused_softmax_cross_entropy.py
.
I'm going reorganize the testing quite a lot before release. For the testing part, I would like you to focus on getting the following complete and working:
$ cd test
$ echo "Test TF1 API:"
$ ./container.sh tensorflow/tensorflow:1.14.0-gpu python test_patch_fused_softmax_cross_entropy.py
$ echo "Test TF2 API:"
$ ./container.sh tensorflow/tensorflow:2.2.0-gpu python test_patch_fused_softmax_cross_entropy.py
Hi, my name is Thomas, I am working at the Oracle Digital Assistant team in which we made extensively use of Tensorflow and Tensorflow-determinism, we really appreciate this wonderful library. I just wish to ask is this issue going to be merged anytime soon?
Hi Thomas (@phqtuyen),
Thanks for letting us know. These ops seem to be generally high priority for determinism in TensorFlow.
@MFreidank: please can you give an update on this and/or move it forward. Most of the rest of the infrastructure is now in place for this.
Hi @phqtuyen, @duncanriach Sorry for the wait, I had been pulled into other things. I've picked up work on this now and will have an update to the PR before end of this week. I'll base my changes on the new infrastructure recently put in place.
Awesome. Just FYI, I'll be offline next week (2020-08-17 through 2020-08-21) and, since I'll need to be involved in reviews and iterations, that might slow things down a bit.
Hey @MFreidank, are you still planning on completing this PR?
Hi @MFreidank, I'm about to go offline again until 2020-09-16, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?
Hi Duncan,
Sorry for the delay on this. Yes, I'll pick it up starting tomorrow my time.
Duncan Riach [email protected] schrieb am Fr., 4. Sep. 2020, 20:07:
Hi @Freidank, I'm about to go offline again until 2020-09-16, but I wanted to check-in with you again. I think that this op's nondeterminism is near the top of the priority/urgency list for TensorFlow, and I want to get it resolved ASAP. Can you commit to pushing this through from your end?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/framework-determinism/pull/21#issuecomment-687301639, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQSWKBQYI2MLT4QV42MP6DSEEUFJANCNFSM4OEKR34Q .
Thank you, Moritz.
Hi again @duncanriach,
I've implemented a first version (on top of the most recent master
branch with enable_determinism
) that patches only tf.nn.sparse_softmax_cross_entropy_with_logits
and tf.nn.softmax_cross_entropy_with_logits
and integrated the unit test case you shared into test/test_softmax_cross_entropy_with_logits.py
.
To test that this works as intended I'll need access to my GPU which will only be available to me in another 6 hours.
If the test cases pass, I'll update the PR with the above.
What remains to be determined is which other functions need to be patched after that (by scanning through tensorflow/python
as outlined by you above).
I will first focus on getting the minimal patch for the functions in tf.nn
to work including unit tests and will then broaden it to include the other functions.
Thanks, @MFreidank. Sounds good. Just to reiterate, I'll be offline until 2020-09-16 and I'll review this after that.
Hey @MFreidank, please will you submit those changes so that I can review them?
@MFreidank, we will soon release version 0.4.0. We intend to include the fused softmax/cross-entropy patch in version 0.5.0 and I want to get started on it ASAP. If I don't hear from you in the 24 hours, we will proceed to implement it without you.