LLaMA-Factory
LLaMA-Factory copied to clipboard
the cutoff of multimodal input sequence
Reminder
- [x] I have read the above rules and searched the existing issues.
Description
As mention in https://github.com/hiyouga/LLaMA-Factory/issues/6844#issuecomment-2644439667, the cutoff of multimodal sequence should not remove the visual(non text) token.
But current logit of cutoff cut the query and response dynamicly.
https://github.com/hiyouga/LLaMA-Factory/blob/0fb44cb3a5499c8da79e73004adc9d16f792b4b3/src/llamafactory/data/processors/supervised.py#L60-L63
For some data like "{query}:VVVVVVVQQQ{Response}:RRRRRRRRRR" (V-vision, Q-query text, R-response text), the result may be "{query}:VVVVVRRRRRRRR".
So should we specify the cutoff of multimodal sequence to reserve the mm tokens as more as possible?
source_len, target_len = infer_seqlen(len(source_ids), len(target_ids), cutoff_len - total_length)
# for multimodal input, just chunck the output
if images or videos or audios:
seqlen = source_len + target_len
source_len = min(len(source_ids), seqlen)
target_len = min(len(target_ids), seqlen - source_len)
source_ids = source_ids[:source_len]
target_ids = target_ids[:target_len]
Pull Request
No response