maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

Add preprocessing utils for Qwen3-Omni

Open hengtaoguo opened this issue 1 month ago • 4 comments

Description

  • Add image/video/audio preprocessing utils for Qwen3-Omni in MaxText.multimodal.qwen3_omni_preprocessor.preprocess_mm_data_qwen3_omni(), returning dataclass Qwen3OmniPreprocessorOutput containing all preprocessed data (pixel_values, pixel_grid_thw, video_values, video_grid_thw, video_second_per_grid, audio_values, audio_mask).
  • Add unit test comparing MaxText implementation with Qwen3-Omni's processor on HuggingFace.
  • [WIP] Refactor [multimodal_utils.py]:
    • MaxText.multimodal.utils: Commonly used basic functions such as image loading and normalization.
    • MaxText.multimodal.{MODEL}_preprocessor.py: Model-specific preprocessing utils.
    • MaxText.multimodal.preprocessor.py: Centralized function preprocess_mm_data() will route to model-specific preprocessing logics based on model name.

Tests

Passing unit tests for MaxText preprocess_mm_data_qwen3_omni vs HuggingFace Qwen3OmniMoeProcessor:

python -m unittest tests.check_qwen3_embedding_vs_reference.TextQwen3OmniPreprocessing

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • [x] I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • [x] I have necessary comments in my code, particularly in hard-to-understand areas.
  • [x] I have run end-to-end tests tests and provided workload links above if applicable.
  • [x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

hengtaoguo avatar Nov 06 '25 16:11 hengtaoguo

is the functionality implemented on cpu in numpy in the torch variant. if so, is there a reason not to want to reuse it?

eitanporat avatar Nov 19 '25 09:11 eitanporat

could you add the new requirements to the pyproject toml (decord and librosa)?

eitanporat avatar Nov 19 '25 16:11 eitanporat

🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions[bot] avatar Nov 19 '25 16:11 github-actions[bot]

is the functionality implemented on cpu in numpy in the torch variant. if so, is there a reason not to want to reuse it?

This has been a long-standing constraint, we intentionally exclude torch from our dependency. So we cannot use torch resize functions and need to reimplement everything in numpy/jnp.

hengtaoguo avatar Nov 19 '25 17:11 hengtaoguo