transformers
                                
                                 transformers copied to clipboard
                                
                                    transformers copied to clipboard
                            
                            
                            
                        Port Gemma to TF
What does this PR do?
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@Rocketknight1 @ArthurZucker - I want to port this model to TF. I found Keras model weights on Kaggle: https://www.kaggle.com/models/google/gemma but I can't find the modeling code for it anywhere, If this has already been ported to TF please let me know.
@Rocketknight1 - can you please create a GPT4 draft like you did for me in https://github.com/huggingface/transformers/pull/26870 . Thank you!
Would be nice to start fresh with keras3 implementation for a new template wdy @Rocketknight1
@ArthurZucker we still don't have the Keras 3 backend in place - I have a partial PR, but I'm not sure the team is ready for a fourth framework in the library yet!
For now, doing this is a TF / Keras 2 port seems like the best idea. There is a Keras 3 port of it already for Keras-NLP here, but we'll likely have to port some bits of this to make it work properly for Keras 2: https://keras.io/api/keras_nlp/models/gemma/
@a8nova do you have a preference between starting from the Keras-NLP port or an automatically translated port of the PyTorch code?
Thank you @ArthurZucker & @Rocketknight1. I think auto-translated port of the pytorch code will be better.
By the way, is it possible to share that prompt or script for the auto translation? I am noting down all the little things GPT4 misses during the translation, i wonder if we could keep improving or adding to the prompt where the auto translation works in one-shot :)
I also want to TF port the tts model vits, so i will have to bother you again for auto translating vits model
Hi @a8nova, here's a port of the Gemma modeling code! Let me know if you need anything else.
I did this port with Claude 3, and if you want to try it yourself, here's the prompt I used:
You are a translation bot designed to translate code in the Hugging Face Transformers library from PyTorch to TensorFlow / Keras.
You will be passed a PyTorch file from the library. Your goal is to output the equivalent
TensorFlow code. If you want, you can think carefully before you start and write any thoughts or issues you have
as comments at the top of the output file. You can also add comments to the code to indicate any issues or
areas of uncertainty. Please preface your comments with "# Claude: " so that we can easily find them.
There are some guidelines you should follow when translating the code:
- When creating layers or other network modules in the `__init__()` method please pass their attribute name as the name kwarg.
- If a class inherits from `PreTrainedModel` it should instead inherit from `TFPreTrainedModel`.
- Retain any docstrings attached to methods like `forward()` and translate them, even when the method is being renamed to call.
- Layer and model classes should accept **kwargs and pass these to `super.__init__()`. They should also be renamed by adding "TF" to the start of their name.
- If the class calls other classes in the same module, please add "TF" to the start of their name if required.
- TensorFlow layers do not require input shape arguments in the same way as PyTorch layers. As a result, the first
  argument to the constructor of layers like `Dense` or `Conv2D` (but not `Embedding`) can usually be removed.
- TensorFlow `Embedding` layers do not have a `padding_idx` argument. Please remove this argument from the constructor.
- Prefer the function `shape_list()`, which returns a list, over methods like `tensor.shape` or `tf.shape(tensor)`. You can get this with `from tf_utils import shape_list`.
- Keras layers do not have a `register_buffer()` method. Instead, replace `self.register_buffer(persistent=True)` with `self.add_weight(trainable=False)`. For `self.register_buffer(persistent=False)`, the best solution is usually to just compute the value in `call()`. If it's totally constant and never changed in the `forward()` method, you can store it as a `tf.constant` in the `__init__()` method instead. 
- Output classes like `BaseModelOutput` or `SequenceClassifierOutput` should have "TF" added to the start of their name.
- PyTorch Conv2D layers are always "channels_first", but TensorFlow convolutions are "channels_last". This will require inputs to be transposed.
- NumPy operations and calls to `.numpy()` must be avoided! Use TensorFlow operations instead.
- Raw layer weights that are not contained in sublayers should be created in the layer `build()` method, not in `__init__()` or `call()`. These are usually created with `self.add_weight()`, passing the PyTorch attribute name as the name kwarg. Other sublayers and modules that are not created with `self.add_weight()`, however, should be created in the `__init__()` method like they are in PyTorch.
- Make sure to retain things like docstrings from the PyTorch code. You can try to translate those too, but if you're unsure, just leave them untranslated and mark with a comment that they should be translated later.
- FlashAttention is not supported in Keras. You can remove code paths that use it.
- If you want an exact replacement for torch's `scaled_dot_product_attention()` function, you can get one via `from tf_utils import scaled_dot_product_attention`.
Claude 3's port looks a lot cleaner than GPT-4's was for IDEFICS. The main things I see are:
- scaled_dot_product_attentionisn't actually in the library until we merge your IDEFICS PR! You might have to copy that in manually for now
- init_weights()and- post_init()calls are still there, we don't want them for TF.
Thank you @Rocketknight1!
I just realized that I never told it to add things like @keras_serializable or @unpack_inputs but it knows the structure of transformers well enough that it did it anyway, lmao. cc @gante - your methods are in its brain!
Hi @a8nova, here's a port of the Gemma modeling code! Let me know if you need anything else.
I did this port with Claude 3, and if you want to try it yourself, here's the prompt I used:
You are a translation bot designed to translate code in the Hugging Face Transformers library from PyTorch to TensorFlow / Keras. You will be passed a PyTorch file from the library. Your goal is to output the equivalent TensorFlow code. If you want, you can think carefully before you start and write any thoughts or issues you have as comments at the top of the output file. You can also add comments to the code to indicate any issues or areas of uncertainty. Please preface your comments with "# Claude: " so that we can easily find them. There are some guidelines you should follow when translating the code: - When creating layers or other network modules in the `__init__()` method please pass their attribute name as the name kwarg. - If a class inherits from `PreTrainedModel` it should instead inherit from `TFPreTrainedModel`. - Retain any docstrings attached to methods like `forward()` and translate them, even when the method is being renamed to call. - Layer and model classes should accept **kwargs and pass these to `super.__init__()`. They should also be renamed by adding "TF" to the start of their name. - If the class calls other classes in the same module, please add "TF" to the start of their name if required. - TensorFlow layers do not require input shape arguments in the same way as PyTorch layers. As a result, the first argument to the constructor of layers like `Dense` or `Conv2D` (but not `Embedding`) can usually be removed. - TensorFlow `Embedding` layers do not have a `padding_idx` argument. Please remove this argument from the constructor. - Prefer the function `shape_list()`, which returns a list, over methods like `tensor.shape` or `tf.shape(tensor)`. You can get this with `from tf_utils import shape_list`. - Keras layers do not have a `register_buffer()` method. Instead, replace `self.register_buffer(persistent=True)` with `self.add_weight(trainable=False)`. For `self.register_buffer(persistent=False)`, the best solution is usually to just compute the value in `call()`. If it's totally constant and never changed in the `forward()` method, you can store it as a `tf.constant` in the `__init__()` method instead. - Output classes like `BaseModelOutput` or `SequenceClassifierOutput` should have "TF" added to the start of their name. - PyTorch Conv2D layers are always "channels_first", but TensorFlow convolutions are "channels_last". This will require inputs to be transposed. - NumPy operations and calls to `.numpy()` must be avoided! Use TensorFlow operations instead. - Raw layer weights that are not contained in sublayers should be created in the layer `build()` method, not in `__init__()` or `call()`. These are usually created with `self.add_weight()`, passing the PyTorch attribute name as the name kwarg. Other sublayers and modules that are not created with `self.add_weight()`, however, should be created in the `__init__()` method like they are in PyTorch. - Make sure to retain things like docstrings from the PyTorch code. You can try to translate those too, but if you're unsure, just leave them untranslated and mark with a comment that they should be translated later. - FlashAttention is not supported in Keras. You can remove code paths that use it. - If you want an exact replacement for torch's `scaled_dot_product_attention()` function, you can get one via `from tf_utils import scaled_dot_product_attention`.Claude 3's port looks a lot cleaner than GPT-4's was for IDEFICS. The main things I see are:
scaled_dot_product_attentionisn't actually in the library until we merge your IDEFICS PR! You might have to copy that in manually for now
init_weights()andpost_init()calls are still there, we don't want them for TF.
I feel like this should be definitely documented somewhere! The prompt is so thorough, love it.
Hi all! I am running into compatibility issues when trying to run the tests for gemma locally, there were too many issues on my mac so I am now trying to run this on a Google colab but still ran into issues, sharing the error here to see if anyone seen it before, thanks!
This is when running the pytorch version:
============================================== ERRORS ==============================================
____________________ ERROR collecting tests/models/gemma/test_modeling_gemma.py ____________________
ImportError while importing test module '/content/transformers/tests/models/gemma/test_modeling_gemma.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/lib/python3.10/dist-packages/_pytest/python.py:617: in _importtestmodule
    mod = import_path(self.path, mode=importmode, root=self.config.rootpath)
/usr/local/lib/python3.10/dist-packages/_pytest/pathlib.py:567: in import_path
    importlib.import_module(module_name)
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
/usr/local/lib/python3.10/dist-packages/_pytest/assertion/rewrite.py:186: in exec_module
    exec(co, module.__dict__)
tests/models/gemma/test_modeling_gemma.py:23: in <module>
    from transformers.testing_utils import (
E   ImportError: cannot import name 'require_read_token' from 'transformers.testing_utils' (/usr/local/lib/python3.10/dist-packages/transformers/testing_utils.py)
Hi @a8nova, my guess is there's some kind of version mismatch here between the transformers that's installed, and the transformers repo you're running the tests in. Try pip install -e . in the repo directory to make sure your versions are synced up?
ugh thank you @Rocketknight1, not sure how i missed that. I am able to run the tests on the colab now, still issues on my mac since pip install -e ".[dev]" is failing, at least i am unblocked on colab so I can continue working there.
sharing error from pip install -e ".[dev] on intel mac, i tried a few things from the web for below error but still seeing it:
ERROR: Could not find a version that satisfies the requirement decord==0.6.0; extra == "dev" (from transformers[dev]) (from versions: none)
ERROR: No matching distribution found for decord==0.6.0; extra == "dev"
In general, I find pip install transformers[dev] isn't really necessary! pip install transformers[quality] should be sufficient for most of what you need for a PR.
Hi @Rocketknight1! A few things:
- Caching isn't implemented yet and so i am skipping tests that test that, please look at this commit
- Since caching isn't implemented yet, some tests also fail because past_key_values is None, what should I do with these tests, skip, override?
- I am not planning on implementing caching for the TF port, is that OK?
Thanks!
Hi @a8nova, when you say you're not implemeting caching, does that mean past_key_values just isn't implemented at all, or we're not implementing the PyTorch StaticCache?
Not implementing StaticCache is totally okay! But we should definitely be able to return and accept some kind of past_key_values.
Hi @Rocketknight1 - I meant I am not planning on implementing the StaticCache so past_key_values will always be None. Yes we are definitely able to return and accept some kind of past_key_values.
Yeah - rather than implementing StaticCache, maybe we can just return tensors with variable shapes, like the other TF models do? You can probably copy the relevant code from another TF causal LM implementation.
Hi @Rocketknight1 - Unrelated to this but why is it that I can't find openELM source code in the Github repo but I see it in the hub https://huggingface.co/apple/OpenELM-270M/blob/main/modeling_openelm.py?
@a8nova that happens a lot - it means it's a custom code model. Those models include their modelling source in the repo itself with the weights, which means they don't have to wait for support in Transformers to share their model. You can load them with AutoModel.from_pretrained("path_to_model", trust_remote_code=True)
A lot of models start out as custom code models, and then get ported into the actual Transformers repo when they gain significant usage!
Closing PR in favor of KerasNLP being able to load HF models!