ms-swift
ms-swift copied to clipboard
Qwen2.5 VL dataset sampler based on modality length
Currently since Qwen2.5 VL with various image sizes, when training on relatively large image size, the image tokens are various for each batch, the default resampling logic would cause OOM without considering the multimodal various lengths in each item.
Any idea to support resampling based on modality length when training