Add preprocessing utils for Qwen3-Omni
Description
- Add image/video/audio preprocessing utils for Qwen3-Omni in
MaxText.multimodal.qwen3_omni_preprocessor.preprocess_mm_data_qwen3_omni(), returning dataclassQwen3OmniPreprocessorOutputcontaining all preprocessed data (pixel_values,pixel_grid_thw,video_values,video_grid_thw,video_second_per_grid,audio_values,audio_mask). - Add unit test comparing MaxText implementation with Qwen3-Omni's processor on HuggingFace.
- [WIP] Refactor [multimodal_utils.py]:
MaxText.multimodal.utils: Commonly used basic functions such as image loading and normalization.MaxText.multimodal.{MODEL}_preprocessor.py: Model-specific preprocessing utils.MaxText.multimodal.preprocessor.py: Centralized functionpreprocess_mm_data()will route to model-specific preprocessing logics based on model name.
Tests
Passing unit tests for MaxText preprocess_mm_data_qwen3_omni vs HuggingFace Qwen3OmniMoeProcessor:
python -m unittest tests.check_qwen3_embedding_vs_reference.TextQwen3OmniPreprocessing
Checklist
Before submitting this PR, please make sure (put X in square brackets):
- [x] I have performed a self-review of my code. For an optional AI review, add the
gemini-reviewlabel. - [x] I have necessary comments in my code, particularly in hard-to-understand areas.
- [x] I have run end-to-end tests tests and provided workload links above if applicable.
- [x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.
is the functionality implemented on cpu in numpy in the torch variant. if so, is there a reason not to want to reuse it?
could you add the new requirements to the pyproject toml (decord and librosa)?
🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.
is the functionality implemented on cpu in numpy in the torch variant. if so, is there a reason not to want to reuse it?
This has been a long-standing constraint, we intentionally exclude torch from our dependency. So we cannot use torch resize functions and need to reimplement everything in numpy/jnp.