framework-reproducibility Reproducibility issue with transformers (BERT) and tf2.2

Dear @duncanriach, Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases. Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).

In spite of combining learnings from:

the "complete recipe" in your slides from gputechconf
your recently suggested workaround for issues with crossentropy loss

... I am still arriving at the following short, non-deterministic colab notebook example.

My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are highlighted below):

	Device	Before training	After training
Run 1	GPU	-641227.5609667897224	-641237.442 `5159916282`
Run 2	GPU	-641227.5609667897224	-641237.442 `3093758523`

Run 1	CPU	-641227.5609667301178	-641238.1506845243275
Run 2	CPU	-641227.5609667301178	-641238.1506845243275

This variance gets increasingly more pronounced when the model is trained for longer periods of time.

Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?

As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.

Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.

Jun 16 '20 16:06 MFreidank

Beautifully presented. Thanks, @MFreidank. I made a copy of your colab code and have been looking at it. The primary issue right now is that the trainable variables are not matching between runs:

### Before training: ###
Summary of weights: -641227.5609667897224
### Before training: ###
Summary of weights: -641227.7293046712875

I can see that you have them matching, and I don't understand that would be different for me. Have you changed the colab code in some way since you ran it?

The second issue I see is that you're setting from_logits=True in the constructor of tf.keras.losses.SparseCategoricalCrossentropy. As your notes suggest, this argument should be excluded (or set to False).

Jun 16 '20 22:06 duncanriach

Oh, I see. You have to restart the runtime to get the same initial trainable variables. I can hopefully provide a work-around for that too.

Jun 16 '20 22:06 duncanriach

So, the solution for getting the same initial trainable variables every time you run the block of code that starts with the definition of summarize_keras_weights is to call tf.random.set_seed at the beginning of that block. This will reset the pseudorandom number generator that is used to initialize the trainable variables of the model.

Jun 16 '20 22:06 duncanriach

And ... solved. By removing from_logits=True from the constructor of tf.keras.losses.SparseCategoricalCrossentropy() I was able to get the same trainable variables after both runs.

### Before training: ###
Summary of weights: -641227.5609667897224
5/5 [==============================] - 7s 1s/step - loss: 0.7225 - accuracy: 0.4000
### After training: ###
Summary of weights: -641238.1517347339541

### Before training: ###
Summary of weights: -641227.5609667897224
5/5 [==============================] - 7s 1s/step - loss: 0.7225 - accuracy: 0.4000
### After training: ###
Summary of weights: -641238.1517347339541

You were so close. If you only you coded exactly what your notes required. :-)

Jun 16 '20 22:06 duncanriach

Please confirm that your issue has been solved. Train your model for much longer, at least for one whole epoch, and confirm that it's getting the accuracy you expect while also getting the perfect, bit-exact reproducibility.

Jun 16 '20 22:06 duncanriach

@duncanriach Thank you! I can reproduce the resolution and things are now deterministic in the scenario above - should have taken my own advice from the notes based on your workaround in the tensorflow issue thread ;)

There is an issue remaining though: changing epochs=1 to epochs=2 reintroduces non-determinism (even when keeping steps_per_epoch at only 5). Note that training for the same 10 steps by using epochs=1, steps_per_epoch=10 is deterministic.

Could you have a look at this? I updated my colab notebook to reflect the current state and expose the issue mentioned above.

Almost looks like keras is doing some non-deterministic operations in between epochs. For my purposes, I may be able to simply artificially stretch the epoch I am training for (to multiple passes over the dataset) and get things running deterministic this way; I'll investigate this. Nevertheless, I believe this warrants some further investigation, happy to help in any way I can.

Update: For epochs=2, steps_per_epoch=10, I found it to be reproducible on the CPU. So the issue must occur on something that does relate to GPU.

Jun 17 '20 10:06 MFreidank

Could you have a look at this?

Will do.

Almost looks like Keras is doing some non-deterministic operations in between epochs.

These between-epoch issues are common and there as several possible sources. Let's see if we can get determinism without you needing to limit the training to one epoch ...

Jun 17 '20 23:06 duncanriach

Running in colab, with my old copy of your code (with the fixes), I'm now no longer seeing reproducibility on 5 steps in one epoch on GPU. This is very concerning and I have not yet figured out what the issue is. Also, looking at your updated colab code and notes, it seems that one epoch with 10 steps on the GPU is not operating reproducibly, which does not match what you wrote above.

Jun 18 '20 02:06 duncanriach

Just to recap where we're at and the solutions we have:

Using tf.random.set_seed, reset TensorFlow's PRNG before initializing trainable variables.
Set TF_DETERMINISTIC_OPS=1 to enable all deterministic ops in the model.
Replace non-deterministic fused softmax/cross-entropy with a deterministic version. You and I are also working on adding a fix for this, which will also be under the control of TF_DETERMINISTIC_OPS.

With these three adjustments, there is still some non-determinism. However, rather than just being totally different on every run, the final state of the trainable variables is now one of a discrete set of values. The number of possible values seems to increase with the number of steps per epoch.

With steps_per_epoch=1 and steps_per_epoch=2, I got the same final value after several runs.

With steps_per_epoch=5, over seven runs, I got only four different results. One result was repeated three times and another was repeated twice.

What this suggests to me is that there may be some non-determinism in the interaction between the data-loader (based on tf.data) and model.fit. I've seen things like this before, but not exactly like this, and nothing jumps out at me from your code that could be causing this (such a multiple data-loader workers or an unseeded shuffle).

I'll investigate more tomorrow.

Jun 18 '20 05:06 duncanriach

@duncanriach Thank you a lot for your work and drive on this and for the conclusive summary of where we stand and what we know. I agree with all your points, but was not able to pinpoint the exact source of the problem (I tried setting workers=0 to make it run on the same thread as the main training loop, but to no avail). Looking forward to your further investigation and happy to help from my side in any way I can.

Jun 18 '20 15:06 MFreidank

I've been trying different batch sizes and number of steps. There seems to be a non-determinism effect that kicks-in with larger batch size and a seemingly independent effect related to the number of steps (and/or perhaps the number of examples trained). This is reminding me of the unresolved aspects of issue 9 (for OpenNMT).

I have not yet gotten the non-determinism debug tool working with this model, which will enable me to dig-in more deeply to isolate the remaining source, or sources, of non-determinism. I'm also learning more about BERT and about transformers in general.

I presume that each step of training this model runs the complete sentence example (or batch of sentence examples) through the encoder and then the decoder, then calculates the loss, and then back-propagates the gradients to the trainable variables. If we see non-determinism in the trainable variables appear on any given step, it will have been caused by that example (or the examples in that batch) interacting with the trainable variables, as they have been trained during the previous steps, via a non-deterministic op or process.

Since this is an RNN model, and a relatively complex one, there is extensive iterative munging happening (although I believe that it will be unrolled), unlike with a non-recurrent DNN. There may be different opportunities for non-determinism to be injected. There may also be the use of sparse operations (for things like sparse embeddings), some of which have been suspect for a while (but have not yet been fully investigated).

I intend to keep investigating this issue.

BTW, in a comment in the code, you mention that the data loading and preparation is deterministic. Did you confirm that. If so, how?

Jun 20 '20 05:06 duncanriach

@MFreidank, we (@wenscarl and I) have isolated the remaining source of nondeterminism in this model. See this comment on TensorFlow Issue 39751 for more information about the source.

We have also confirmed that this was the only remaining source of nondeterminism in the model by temporarily replacing the use of tf.gather in the huggingface/transformers BERT code with a much slower tf.linalg.matmul operation, the dense backprop output of which can be used directly to update the word embedding matrix (without the need for the currently-nondeterministic tf.convert_to_tensor). The model trained reproducibly for thousands of batches, over multiple epochs.

We are close to releasing a patch for the TensorFlow segment sum ops, which will, when applied via fwd9m.tensorflow.enable_determinism, will remove this final source of nondeterminism.

Oct 23 '20 22:10 duncanriach

Update, @wenscarl has confirmed that the patch we are about to release (to be enabled via fwd9m.tensorflow.enable_determinism) will resolve the final source of nondeterminism in this model, causing it to train determinismtically.

Nov 05 '20 01:11 duncanriach

@duncanriach very thank you for your contributions towards making tensorflow deterministic. I am using huggingface/transformers BERT with tf2.2. And I was wondering what is the time to release the patch.

Dec 23 '20 14:12 Zminghua

Hi @Zminghua, I don't have an estimated release date for the patch, but it's relatively high priority for us. The patch will work with TensorFlow version 2.3 and earlier. A recently-discovered problem, which we're attempting to find a solution for, is that from TensorFlow version 2.4 onwards the TensorFlow API no longer exposes the mechanisms that allow for a dynamic patch to be applied from outside the distributed package. This means that we'll have to focus on getting these solutions into upstream stock TensorFlow rather than relying on the theoretically quick triage route provided by patching.

Jan 05 '21 03:01 duncanriach

Hi @duncanriach,

through putting "fwd9m" sub-dir in my project dir,

then import as follows from fwd9m.tensorflow import enable_determinism enable_determinism()

my code has become fully deterministic when running on GPU.

really thank you again ~

Jan 06 '21 03:01 Zminghua

Oh, you're welcome. Right, you can just clone the code and use it, of course, rather than waiting for the PyPI release.

Jan 07 '21 21:01 duncanriach

Update: we have confirmed that fwd9m.tensorflow.enable_determinism, which currently includes patching of segment_sum and unsorted_segment_sum will, in fact, work on TensorFlow 2.4.0. I don't understand why this is. It's not what I expect given what was in the version 2.4.0 release notes and the associated changes in the stock TensorFlow source code.

Jan 12 '21 04:01 duncanriach

I cloned the repository and follow the instruction from above, i.e from framework_determinism.fwd9m.tensorflow import enable_determinism. However, I get this error message: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/mnt/shared_ad2_mt1/thopham/projects/exp-1/oda-cognitive/services/train-pool/train-models/framework_determinism/fwd9m/tensorflow/enable_determinism.py", line 61, in _enable_determinism patch_bias_add(_silent=True) TypeError: 'module' object is not callable I am using Python3.6

Much appreciated.

Jun 09 '21 19:06 phqtuyen

Hi @phqtuyen,

Please pull the master branch and try again.

This was a bug that only showed up with stock TensorFlow versions 1.14 through 2.0. It was fixed in the incomplete and un-merged integration-testing branch. This demonstrates the hazards involved in using unreleased (and non-regression-tested) code.

Let me know how it goes.

Jun 11 '21 00:06 duncanriach

This should be fixed in TF 2.7 by PR 51861. Please will someone confirm so that this issue can be closed.

Sep 17 '21 01:09 duncanriach

framework-reproducibility framework-reproducibility copied to clipboard

Reproducibility issue with transformers (BERT) and tf2.2

framework-reproducibility
framework-reproducibility copied to clipboard