mediapipe Conversion of gemma-2-2b-it model to TensorFlow Lite

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

None

OS Platform and Distribution

Google Colab (Linux) Ubuntu 22.04.3 LTS

MediaPipe Tasks SDK version

0.10.14

Task name (e.g. Image classification, Gesture recognition etc.)

LLM Inference

Programming Language and version (e.g. C++, Python, Java)

Python

Describe the actual behavior

The gemma-2-2b-it model must get converted to a TFLite model (for cpu)

Describe the expected behaviour

The converter.convert_checkpoint methods throws an AssertionError with no message

Standalone code/steps you may have used to try to get what you need

from huggingface_hub import hf_hub_download
import os
import mediapipe as mp
from mediapipe.tasks.python.genai import converter

REPO_ID = "google/gemma-2-2b-it"
FILENAMES = ["tokenizer.json", "tokenizer_config.json", "model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
os.environ['HF_TOKEN'] = "<token>"
for filename in FILENAMES:
  hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir="./gemma-2-2b-it")

config = converter.ConversionConfig(
    input_ckpt="/content/gemma-2-2b-it", 
    ckpt_format='safetensors', 
    model_type='GEMMA_2B', 
    backend="cpu", 
    output_dir="/content/intermediate/gemma-2-2b-it/", 
    combine_file_only=False, 
    vocab_model_file="/content/gemma-2-2b-it", 
    output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
)
converter.convert_checkpoint(config)

Other info / Complete Logs

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-ae16540c09c6> in <cell line: 14>()
     12     output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
     13 )
---> 14 converter.convert_checkpoint(config)

3 frames
/usr/local/lib/python3.10/dist-packages/mediapipe/tasks/python/genai/converter/quantization_util.py in quantize_tensor(var, axis, factor, sym, number_bits, use_fp, add_scale_eps, optimization_on_bound, p_value, per_channel, block_size)
    352   """
    353   # TODO: support jnp.float8_e5m2
--> 354   assert number_bits == 8 or number_bits == 4 , f"Number bits {number_bits}"
    355   jnp_var = jnp.asarray(var)
    356   # When using sub-channel, the contracting dim is split into a sub-channel

Aug 15 '24 09:08 shubham0204

Hi @shubham0204,

Could you please confirm that you are using the example Colab provided here for model conversion and learning about the required arguments for the converter?

Thank you!!

Aug 16 '24 08:08 kuaashish

Yes @kuaashish I am using the same notebook. Here are the additional blocks of code I added to download Gemma 2 and convert it to TFLite,

from huggingface_hub import hf_hub_download
import os

REPO_ID = "google/gemma-2-2b-it"
FILENAMES = ["tokenizer.json", "tokenizer_config.json", "model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
os.environ['HF_TOKEN'] = "<token>"
for filename in FILENAMES:
  hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir="./gemma-2-2b-it")

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
    input_ckpt="/content/gemma-2-2b-it",
    ckpt_format='safetensors',
    model_type='GEMMA_2B',
    backend="cpu",
    output_dir="/content/intermediate/gemma-2-2b-it/",
    combine_file_only=False,
    vocab_model_file="/content/gemma-2-2b-it",
    output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
)
converter.convert_checkpoint(config)

Aug 16 '24 11:08 shubham0204

Add layer_norms in LayerType Class from /site-packages/mediapipe/tasks/python/genai/converter/safetensors_converter.py could pass throuth the Assert,but the output_tflite_file looks bad because its size does not reduce.


class LayerType(enum.Enum):
  """Enum for layer type."""

  NONE = 0
  ATTENTION = 1  # Layer is part of the attention module.
  FEEDFORWARD = 2  # Layer is part of the feedforward module in the Transformer.
  EMBEDDING = 3  # Layer is the embedding lookup or final projection layer.
  LAYER_NORM = (
      4  # Layer is layer normalization before and after attention layer.
  )
  LORA = 5  # Layer is LoRA weights augmented on the base model layers.

  @classmethod
  def get_layer_type(cls, layer_name: str):
    """Gets the layer type of the given layer name."""
    ffn_layers = [
        "mlp",
    ]
    attn_layers = [
        "self_attn",
    ]
    emb_layers = [
        "embed_tokens",
        "lm_head",
    ]
    layer_norms = [
        "input_layernorm",
        "post_attention_layernorm",
        "post_feedforward_layernorm",
        "pre_feedforward_layernorm",
        "final_layernorm",
        "model.norm.weight",
    ]
    lora_layers = ["lora"]
    if any(sub_name in layer_name for sub_name in lora_layers):
      return LayerType.LORA
    if any(sub_name in layer_name for sub_name in attn_layers):
      return LayerType.ATTENTION
    if any(sub_name in layer_name for sub_name in ffn_layers):
      return LayerType.FEEDFORWARD
    if any(sub_name in layer_name for sub_name in emb_layers):
      return LayerType.EMBEDDING
    if any(sub_name in layer_name for sub_name in layer_norms):
      return LayerType.LAYER_NORM
    else:
      return LayerType.NONE

Aug 20 '24 08:08 Woody0414

Thanks @Woody0414. I modified the Mediapipe source file, but then received the following error,

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-3-913a430439f8>](https://localhost:8080/#) in <cell line: 14>()
     12     output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
     13 )
---> 14 converter.convert_checkpoint(config)

1 frames
[/usr/local/lib/python3.10/dist-packages/mediapipe/tasks/python/genai/converter/llm_converter.py](https://localhost:8080/#) in combined_weight_bins_to_tflite(model_type, backend, weight_path, output_tflite_file, vocab_model_file, lora_rank, lora_weight_path, lora_output_tflite_file)
    180     if lora_rank is not None:
    181       logging.fatal('LoRA is not supported for CPU backend.')
--> 182     model_ckpt_util.GenerateCpuTfLite(
    183         model_type,
    184         weight_path,

RuntimeError: NOT_FOUND: The path does not exist: /content/intermediate/gemma-2-2b-it/params.lm.transformer.x_layers_0.ff_layer.pre_layer_norm.scale_quantized_scale

The params.lm.transformer.x_layers_0.ff_layer.pre_layer_norm.scale file exists, but not params.lm.transformer.x_layers_0.ff_layer.pre_layer_norm.scale_quantized_scale

Aug 21 '24 01:08 shubham0204

Hi @shubham0204,

It appears you are trying to convert the recently released Gemma-2-2b model. Our initial testing has focused on the Gemma 2b model, and you can find more information in our documentation here. Currently, this model cannot be converted into a TFLite format, though support for this is on our roadmap. However, we cannot provide a specific timeline for availability at this moment.

Thank you!!

Aug 21 '24 06:08 kuaashish

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Aug 29 '24 01:08 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Aug 29 '24 03:08 google-ml-butler[bot]

@kuaashish would I get an update on this issue when the support to convert Gemma2 models is available?

Sep 02 '24 03:09 shubham0204

I have encountered the same issue here,and I followed the example here,the error log:

Traceback (most recent call last):
  File "/home/franzkafka/Desktop/mediapipe/convert.py", line 15, in <module>
    converter.convert_checkpoint(config)
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/llm_converter.py", line 323, in convert_checkpoint
    maybe_quantize_and_write_tensors_to_bins(loader, config)
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/llm_converter.py", line 284, in maybe_quantize_and_write_tensors_to_bins
    quantized_tensors = quantize_by_actions(
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/llm_converter.py", line 169, in quantize_by_actions
    target_var, scale = quantization_util.quantize_tensor(
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/quantization_util.py", line 354, in quantize_tensor
    assert number_bits == 8 or number_bits == 4
AssertionError

Sep 04 '24 11:09 FranzKafkaYu

@kuaashish Hi kuaashish,in MediaPipe docs it says that MediaPipe LLM inference API support gemma2 already,but now I can't find available Gemma2 TFLite format model from kaggle,so how can I use MediaPipe LLM Inference API to load Gemma2 models?

Sep 05 '24 01:09 FranzKafkaYu

Hi @FranzKafkaYu,

Could you please create a new issue with a detailed description of the support you need? This will help us and the community identify and address the problem effectively with a relevant issue title.

Thank you!!

Sep 05 '24 03:09 kuaashish

Hi @shubham0204,

It appears you are trying to convert the recently released Gemma-2-2b model. Our initial testing has focused on the Gemma 2b model, and you can find more information in our documentation here. Currently, this model cannot be converted into a TFLite format, though support for this is on our roadmap. However, we cannot provide a specific timeline for availability at this moment.

Thank you!!

issue created:https://github.com/google-ai-edge/mediapipe/issues/5610

Sep 05 '24 07:09 FranzKafkaYu

Hi,

We updated our docs to provide info on using Gemma2-2B here. When we initially supported Gemma2-2B, the only pathway to using it on-device was converting the model through ai_edge_torch. It still requires a system with a lot of memory to do the conversion+quantization, so we decided to just directly host the necessary file on Kaggle (at this URL). You can download the models through this interface:

For Gemma2-2b, we support a CPU version and a GPU version. Both versions work in the LLM Inference API. The CPU version is a classic "TF Lite" file and can be used in traditional ways, as shown in an example here

The LLM Inference API (doc link above), is a full featured offering that you can directly call via an Android app, as shown in our samples here

Nov 08 '24 18:11 talumbau

Hi @shubham0204,

Could you please review the above and confirm if we can close the status and mark it resolved internally?

Thank you!!

Nov 20 '24 08:11 kuaashish

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Nov 28 '24 02:11 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Dec 05 '24 02:12 github-actions[bot]

mediapipe mediapipe copied to clipboard

Conversion of gemma-2-2b-it model to TensorFlow Lite

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

OS Platform and Distribution

MediaPipe Tasks SDK version

Task name (e.g. Image classification, Gesture recognition etc.)

Programming Language and version (e.g. C++, Python, Java)

Describe the actual behavior

Describe the expected behaviour

Standalone code/steps you may have used to try to get what you need

Other info / Complete Logs

mediapipe
mediapipe copied to clipboard