LLaMA-Factory the cutoff of multimodal input sequence

the cutoff of multimodal input sequence

Open JJJYmmm opened this issue 2 weeks ago • 0 comments

Reminder

[x] I have read the above rules and searched the existing issues.

Description

As mention in https://github.com/hiyouga/LLaMA-Factory/issues/6844#issuecomment-2644439667, the cutoff of multimodal sequence should not remove the visual(non text) token.

But current logit of cutoff cut the query and response dynamicly.

https://github.com/hiyouga/LLaMA-Factory/blob/0fb44cb3a5499c8da79e73004adc9d16f792b4b3/src/llamafactory/data/processors/supervised.py#L60-L63

For some data like "{query}:VVVVVVVQQQ{Response}:RRRRRRRRRR" (V-vision, Q-query text, R-response text), the result may be "{query}:VVVVVRRRRRRRR".

So should we specify the cutoff of multimodal sequence to reserve the mm tokens as more as possible?

        source_len, target_len = infer_seqlen(len(source_ids), len(target_ids), cutoff_len - total_length)

        # for multimodal input, just chunck the output
        if images or videos or audios:
            seqlen = source_len + target_len
            source_len = min(len(source_ids), seqlen)
            target_len = min(len(target_ids), seqlen - source_len)

        source_ids = source_ids[:source_len]
        target_ids = target_ids[:target_len]

Pull Request

No response

Feb 11 '25 04:02 JJJYmmm

LLaMA-Factory LLaMA-Factory copied to clipboard

the cutoff of multimodal input sequence

Reminder

Description

Pull Request

LLaMA-Factory
LLaMA-Factory copied to clipboard