framework-reproducibility icon indicating copy to clipboard operation
framework-reproducibility copied to clipboard

Lack of reproducibility when using Huggingface transformers library (TensorFlow version)

Open dmitriydligach opened this issue 4 years ago • 11 comments

Dear developers,

I included in my code all the steps listed in this repository but still could not achieve reproducibility either using TF 2.1 or TF 2.0. Here's the link to my code:

https://github.com/dmitriydligach/Thyme/blob/master/Keras/et.py

Please help.

dmitriydligach avatar Apr 28 '20 18:04 dmitriydligach

@dmitriydligach Did you ever get this resolved?

MFreidank avatar Jun 15 '20 12:06 MFreidank

@MFreidank Nope. I switched to PyTorch, which has a more reliable way to enforce determinism.

dmitriydligach avatar Jun 16 '20 16:06 dmitriydligach

@dmitriydligach Just to verify: your code becomes fully reproducible with pytorch?

MFreidank avatar Jun 16 '20 17:06 MFreidank

PyTorch has potentially different non-deterministic ops than TensorFlow, and no general mechanism, yet, to enable deterministic op functionality. Both PyTorch and TensorFlow now have the ability to enable deterministic cuDNN functionality.

This code may use an op that happens to be non-deterministic in TensorFlow but deterministic in PyTorch.

I'm hoping to look at this code in detail soon, hopefully today.

duncanriach avatar Jun 16 '20 17:06 duncanriach

@MFreidank In most cases, I get the exact same results every time I run my PyTorch code (including loss and accuracy for each epoch). In some (relatively infrequent) cases, there's still a difference, but it's not nearly as large as in the case of tensorflow.

dmitriydligach avatar Jun 16 '20 17:06 dmitriydligach

PyTorch has potentially different non-deterministic ops than TensorFlow, and no general mechanism, yet, to enable deterministic op functionality. Both PyTorch and TensorFlow now have the ability to enable deterministic cuDNN functionality.

This code may use an op that happens to be non-deterministic in TensorFlow but deterministic in PyTorch.

I'm hoping to look at this code in detail soon, hopefully today.

@duncanriach Thanks for your blazingly fast response! :) I would still have an interest in resolving this issue in TF 2.2 and would highly appreciate it if you could help investigate.

A helpful starting point could be my colab example.

@dmitriydligach Thanks for those additional details, that sounds like there is still a slight non-determinism in pytorch as well, but it might not affect loss/accuracy as strongly. This is valuable information for me, thank you for sharing your experience :)

MFreidank avatar Jun 16 '20 17:06 MFreidank

@dmitriydligach: I'm sorry that I didn't get to sorting this out for you in time to benefit from determinism in TensorFlow.

@MFreidank: I'll prioritize taking a look at these issues. They could have the same underlying cause, or source, or there could be different sources. Often in these kinds of problems there is an issue with setup that is easy to resolve. I intend to add better step-by-step instructions to the README for that. Sometimes a known (and not-yet-fixed) non-deterministic op is being used, and sometimes there is a new discovery, an op that is non-deterministic that we didn't know about about. We'll figure this out.

duncanriach avatar Jun 16 '20 17:06 duncanriach

@duncanriach Thanks a lot for taking the time to look into this and for your encouragement. I feel much more confident about this now, knowing that someone with your experience will be having a look.

MFreidank avatar Jun 16 '20 18:06 MFreidank

Hey @dmitriydligach, it looks like we have reproducibility in on issue 19 (Huggingface Transformers BERT for TensorFlow). @MFreidank is confirming. Looking at your code, I don't see any reason for there to be non-determinism. I want to repro what you're seeing so that I can debug it. I have it running, but it looks like I have to specify DATA_ROOT and provide data there. Can you give me instructions to repro with the data you're using?

duncanriach avatar Jun 17 '20 03:06 duncanriach

@duncanriach Non-reproducibility of the code of @dmitriydligach may be related to him training for multiple epochs, see my update on issue #19.

MFreidank avatar Jun 17 '20 10:06 MFreidank

@duncanriach Thank you very much for looking into this issue.

Unfortunately, I'm not able to provide the data (this is medical data that can only be distributed via a data use agreement). However, perhaps it would help you to know that the data consists of relatively short text fragments (max_len ~ 150 word pieces)...

dmitriydligach avatar Jun 17 '20 20:06 dmitriydligach