transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Port Gemma to TF

Open a8nova opened this issue 1 year ago • 16 comments

What does this PR do?

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

a8nova avatar Mar 01 '24 22:03 a8nova

@Rocketknight1 @ArthurZucker - I want to port this model to TF. I found Keras model weights on Kaggle: https://www.kaggle.com/models/google/gemma but I can't find the modeling code for it anywhere, If this has already been ported to TF please let me know.

@Rocketknight1 - can you please create a GPT4 draft like you did for me in https://github.com/huggingface/transformers/pull/26870 . Thank you!

a8nova avatar Mar 01 '24 22:03 a8nova

Would be nice to start fresh with keras3 implementation for a new template wdy @Rocketknight1

ArthurZucker avatar Mar 02 '24 04:03 ArthurZucker

@ArthurZucker we still don't have the Keras 3 backend in place - I have a partial PR, but I'm not sure the team is ready for a fourth framework in the library yet!

For now, doing this is a TF / Keras 2 port seems like the best idea. There is a Keras 3 port of it already for Keras-NLP here, but we'll likely have to port some bits of this to make it work properly for Keras 2: https://keras.io/api/keras_nlp/models/gemma/

@a8nova do you have a preference between starting from the Keras-NLP port or an automatically translated port of the PyTorch code?

Rocketknight1 avatar Mar 04 '24 13:03 Rocketknight1

Thank you @ArthurZucker & @Rocketknight1. I think auto-translated port of the pytorch code will be better.

By the way, is it possible to share that prompt or script for the auto translation? I am noting down all the little things GPT4 misses during the translation, i wonder if we could keep improving or adding to the prompt where the auto translation works in one-shot :)

I also want to TF port the tts model vits, so i will have to bother you again for auto translating vits model

a8nova avatar Mar 04 '24 21:03 a8nova

Hi @a8nova, here's a port of the Gemma modeling code! Let me know if you need anything else.

I did this port with Claude 3, and if you want to try it yourself, here's the prompt I used:

You are a translation bot designed to translate code in the Hugging Face Transformers library from PyTorch to TensorFlow / Keras.

You will be passed a PyTorch file from the library. Your goal is to output the equivalent
TensorFlow code. If you want, you can think carefully before you start and write any thoughts or issues you have
as comments at the top of the output file. You can also add comments to the code to indicate any issues or
areas of uncertainty. Please preface your comments with "# Claude: " so that we can easily find them.

There are some guidelines you should follow when translating the code:

- When creating layers or other network modules in the `__init__()` method please pass their attribute name as the name kwarg.
- If a class inherits from `PreTrainedModel` it should instead inherit from `TFPreTrainedModel`.
- Retain any docstrings attached to methods like `forward()` and translate them, even when the method is being renamed to call.
- Layer and model classes should accept **kwargs and pass these to `super.__init__()`. They should also be renamed by adding "TF" to the start of their name.
- If the class calls other classes in the same module, please add "TF" to the start of their name if required.
- TensorFlow layers do not require input shape arguments in the same way as PyTorch layers. As a result, the first
  argument to the constructor of layers like `Dense` or `Conv2D` (but not `Embedding`) can usually be removed.
- TensorFlow `Embedding` layers do not have a `padding_idx` argument. Please remove this argument from the constructor.
- Prefer the function `shape_list()`, which returns a list, over methods like `tensor.shape` or `tf.shape(tensor)`. You can get this with `from tf_utils import shape_list`.
- Keras layers do not have a `register_buffer()` method. Instead, replace `self.register_buffer(persistent=True)` with `self.add_weight(trainable=False)`. For `self.register_buffer(persistent=False)`, the best solution is usually to just compute the value in `call()`. If it's totally constant and never changed in the `forward()` method, you can store it as a `tf.constant` in the `__init__()` method instead. 
- Output classes like `BaseModelOutput` or `SequenceClassifierOutput` should have "TF" added to the start of their name.
- PyTorch Conv2D layers are always "channels_first", but TensorFlow convolutions are "channels_last". This will require inputs to be transposed.
- NumPy operations and calls to `.numpy()` must be avoided! Use TensorFlow operations instead.
- Raw layer weights that are not contained in sublayers should be created in the layer `build()` method, not in `__init__()` or `call()`. These are usually created with `self.add_weight()`, passing the PyTorch attribute name as the name kwarg. Other sublayers and modules that are not created with `self.add_weight()`, however, should be created in the `__init__()` method like they are in PyTorch.
- Make sure to retain things like docstrings from the PyTorch code. You can try to translate those too, but if you're unsure, just leave them untranslated and mark with a comment that they should be translated later.
- FlashAttention is not supported in Keras. You can remove code paths that use it.
- If you want an exact replacement for torch's `scaled_dot_product_attention()` function, you can get one via `from tf_utils import scaled_dot_product_attention`.

Claude 3's port looks a lot cleaner than GPT-4's was for IDEFICS. The main things I see are:

  • scaled_dot_product_attention isn't actually in the library until we merge your IDEFICS PR! You might have to copy that in manually for now
  • init_weights() and post_init() calls are still there, we don't want them for TF.

Rocketknight1 avatar Mar 05 '24 20:03 Rocketknight1

Thank you @Rocketknight1!

a8nova avatar Mar 06 '24 07:03 a8nova

I just realized that I never told it to add things like @keras_serializable or @unpack_inputs but it knows the structure of transformers well enough that it did it anyway, lmao. cc @gante - your methods are in its brain!

Rocketknight1 avatar Mar 06 '24 13:03 Rocketknight1

Hi @a8nova, here's a port of the Gemma modeling code! Let me know if you need anything else.

I did this port with Claude 3, and if you want to try it yourself, here's the prompt I used:

You are a translation bot designed to translate code in the Hugging Face Transformers library from PyTorch to TensorFlow / Keras.

You will be passed a PyTorch file from the library. Your goal is to output the equivalent
TensorFlow code. If you want, you can think carefully before you start and write any thoughts or issues you have
as comments at the top of the output file. You can also add comments to the code to indicate any issues or
areas of uncertainty. Please preface your comments with "# Claude: " so that we can easily find them.

There are some guidelines you should follow when translating the code:

- When creating layers or other network modules in the `__init__()` method please pass their attribute name as the name kwarg.
- If a class inherits from `PreTrainedModel` it should instead inherit from `TFPreTrainedModel`.
- Retain any docstrings attached to methods like `forward()` and translate them, even when the method is being renamed to call.
- Layer and model classes should accept **kwargs and pass these to `super.__init__()`. They should also be renamed by adding "TF" to the start of their name.
- If the class calls other classes in the same module, please add "TF" to the start of their name if required.
- TensorFlow layers do not require input shape arguments in the same way as PyTorch layers. As a result, the first
  argument to the constructor of layers like `Dense` or `Conv2D` (but not `Embedding`) can usually be removed.
- TensorFlow `Embedding` layers do not have a `padding_idx` argument. Please remove this argument from the constructor.
- Prefer the function `shape_list()`, which returns a list, over methods like `tensor.shape` or `tf.shape(tensor)`. You can get this with `from tf_utils import shape_list`.
- Keras layers do not have a `register_buffer()` method. Instead, replace `self.register_buffer(persistent=True)` with `self.add_weight(trainable=False)`. For `self.register_buffer(persistent=False)`, the best solution is usually to just compute the value in `call()`. If it's totally constant and never changed in the `forward()` method, you can store it as a `tf.constant` in the `__init__()` method instead. 
- Output classes like `BaseModelOutput` or `SequenceClassifierOutput` should have "TF" added to the start of their name.
- PyTorch Conv2D layers are always "channels_first", but TensorFlow convolutions are "channels_last". This will require inputs to be transposed.
- NumPy operations and calls to `.numpy()` must be avoided! Use TensorFlow operations instead.
- Raw layer weights that are not contained in sublayers should be created in the layer `build()` method, not in `__init__()` or `call()`. These are usually created with `self.add_weight()`, passing the PyTorch attribute name as the name kwarg. Other sublayers and modules that are not created with `self.add_weight()`, however, should be created in the `__init__()` method like they are in PyTorch.
- Make sure to retain things like docstrings from the PyTorch code. You can try to translate those too, but if you're unsure, just leave them untranslated and mark with a comment that they should be translated later.
- FlashAttention is not supported in Keras. You can remove code paths that use it.
- If you want an exact replacement for torch's `scaled_dot_product_attention()` function, you can get one via `from tf_utils import scaled_dot_product_attention`.

Claude 3's port looks a lot cleaner than GPT-4's was for IDEFICS. The main things I see are:

  • scaled_dot_product_attention isn't actually in the library until we merge your IDEFICS PR! You might have to copy that in manually for now
  • init_weights() and post_init() calls are still there, we don't want them for TF.

I feel like this should be definitely documented somewhere! The prompt is so thorough, love it.

ariG23498 avatar Mar 12 '24 15:03 ariG23498

Hi all! I am running into compatibility issues when trying to run the tests for gemma locally, there were too many issues on my mac so I am now trying to run this on a Google colab but still ran into issues, sharing the error here to see if anyone seen it before, thanks!

This is when running the pytorch version:

============================================== ERRORS ==============================================
____________________ ERROR collecting tests/models/gemma/test_modeling_gemma.py ____________________
ImportError while importing test module '/content/transformers/tests/models/gemma/test_modeling_gemma.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/lib/python3.10/dist-packages/_pytest/python.py:617: in _importtestmodule
    mod = import_path(self.path, mode=importmode, root=self.config.rootpath)
/usr/local/lib/python3.10/dist-packages/_pytest/pathlib.py:567: in import_path
    importlib.import_module(module_name)
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
/usr/local/lib/python3.10/dist-packages/_pytest/assertion/rewrite.py:186: in exec_module
    exec(co, module.__dict__)
tests/models/gemma/test_modeling_gemma.py:23: in <module>
    from transformers.testing_utils import (
E   ImportError: cannot import name 'require_read_token' from 'transformers.testing_utils' (/usr/local/lib/python3.10/dist-packages/transformers/testing_utils.py)

a8nova avatar Apr 17 '24 11:04 a8nova

Hi @a8nova, my guess is there's some kind of version mismatch here between the transformers that's installed, and the transformers repo you're running the tests in. Try pip install -e . in the repo directory to make sure your versions are synced up?

Rocketknight1 avatar Apr 17 '24 12:04 Rocketknight1

ugh thank you @Rocketknight1, not sure how i missed that. I am able to run the tests on the colab now, still issues on my mac since pip install -e ".[dev]" is failing, at least i am unblocked on colab so I can continue working there.

sharing error from pip install -e ".[dev] on intel mac, i tried a few things from the web for below error but still seeing it:

ERROR: Could not find a version that satisfies the requirement decord==0.6.0; extra == "dev" (from transformers[dev]) (from versions: none)
ERROR: No matching distribution found for decord==0.6.0; extra == "dev"

a8nova avatar Apr 17 '24 12:04 a8nova

In general, I find pip install transformers[dev] isn't really necessary! pip install transformers[quality] should be sufficient for most of what you need for a PR.

Rocketknight1 avatar Apr 17 '24 13:04 Rocketknight1

Hi @Rocketknight1! A few things:

  1. Caching isn't implemented yet and so i am skipping tests that test that, please look at this commit
  2. Since caching isn't implemented yet, some tests also fail because past_key_values is None, what should I do with these tests, skip, override?
  3. I am not planning on implementing caching for the TF port, is that OK?

Thanks!

a8nova avatar Apr 26 '24 06:04 a8nova

Hi @a8nova, when you say you're not implemeting caching, does that mean past_key_values just isn't implemented at all, or we're not implementing the PyTorch StaticCache?

Not implementing StaticCache is totally okay! But we should definitely be able to return and accept some kind of past_key_values.

Rocketknight1 avatar Apr 26 '24 15:04 Rocketknight1

Hi @Rocketknight1 - I meant I am not planning on implementing the StaticCache so past_key_values will always be None. Yes we are definitely able to return and accept some kind of past_key_values.

a8nova avatar Apr 30 '24 12:04 a8nova

Yeah - rather than implementing StaticCache, maybe we can just return tensors with variable shapes, like the other TF models do? You can probably copy the relevant code from another TF causal LM implementation.

Rocketknight1 avatar Apr 30 '24 13:04 Rocketknight1

Hi @Rocketknight1 - Unrelated to this but why is it that I can't find openELM source code in the Github repo but I see it in the hub https://huggingface.co/apple/OpenELM-270M/blob/main/modeling_openelm.py?

a8nova avatar May 19 '24 11:05 a8nova

@a8nova that happens a lot - it means it's a custom code model. Those models include their modelling source in the repo itself with the weights, which means they don't have to wait for support in Transformers to share their model. You can load them with AutoModel.from_pretrained("path_to_model", trust_remote_code=True)

A lot of models start out as custom code models, and then get ported into the actual Transformers repo when they gain significant usage!

Rocketknight1 avatar May 20 '24 13:05 Rocketknight1

Closing PR in favor of KerasNLP being able to load HF models!

a8nova avatar Jul 02 '24 16:07 a8nova