InternVL Thinking mode not working correct

Hi! Thank you for the work done! Maybe I'm doing something wrong, but when I use R1_SYSTEM_PROMPT, the model does not produce thinking content. There is no content separation method inside the chat, so I expected to see thinking content and just content in response. I used cascade RL trained models (InternVL3.5-241B-A28B and InternVL3.5-30B-A3B, tried MPO-only training versions) for images and videos. Maybe thinking is not mandatory, as it was in Qwen 3 when using the think mode, then could you tell me which prompt to use to regularly cause the model to think? I will be grateful for any answer. Using R1_SYSTEM_PROMPT: After that, an example in which thinking content was expected, but did not appear:

Aug 27 '25 12:08 VesVlad

Thank you for your interest in our paper. This issue may arise because our model has not been trained on captioning tasks in the thinking mode. To ensure that thinking mode is reliably triggered, you may consider replacing the following code segment in modeling_internvl_chat.py.

Specifically, replace

        history = [] if history is None else history
        for (old_question, old_answer) in history:
            template.append_message(template.roles[0], old_question)
            template.append_message(template.roles[1], old_answer)
        template.append_message(template.roles[0], question)
        template.append_message(template.roles[1], None)
        query = template.get_prompt()

with

        history = [] if history is None else history
        for (old_question, old_answer) in history:
            template.append_message(template.roles[0], old_question)
            template.append_message(template.roles[1], old_answer)
        template.append_message(template.roles[0], question)
        template.append_message(template.roles[1], '<think>')
        query = template.get_prompt()

        if query.endswith(template.sep):
            query = query[:-len(template.sep)]

Aug 29 '25 07:08 Weiyun1025

Thank you for your interest in our paper. This issue may arise because our model has not been trained on captioning tasks in the thinking mode. To ensure that thinking mode is reliably triggered, you may consider replacing the following code segment in modeling_internvl_chat.py.

Specifically, replace

    history = [] if history is None else history
    for (old_question, old_answer) in history:
        template.append_message(template.roles[0], old_question)
        template.append_message(template.roles[1], old_answer)
    template.append_message(template.roles[0], question)
    template.append_message(template.roles[1], None)
    query = template.get_prompt()

with

    history = [] if history is None else history
    for (old_question, old_answer) in history:
        template.append_message(template.roles[0], old_question)
        template.append_message(template.roles[1], old_answer)
    template.append_message(template.roles[0], question)
    template.append_message(template.roles[1], '<think>')
    query = template.get_prompt()

    if query.endswith(template.sep):
        query = query[:-len(template.sep)]

I encounter same question when doing mutli-image inference. Is the problem caused by same reason?

Sep 29 '25 06:09 Celtyee