ART Gemma 3 fix

As noted in the main READ.me, Gemma 3 models are not yet supported by ART, due to Gemma not accepting the enable_prefix_caching parameter.

To solve this issue, I've introduced the following changes on get_model_config.py:

use_gemma_config = config.get("use_gemma_config", False)

if use_gemma_config:
        init_args = InitArgs(
            model_name=base_model,
            max_seq_length=32768,
            load_in_4bit=True,  # False for LoRA 16bit
            fast_inference=True,  # Enable vLLM fast inference
            # vLLM args
            disable_log_stats=False,
            gpu_memory_utilization=(
                0.79 if enable_sleep_mode else 0.55
            ),  # Reduce if out of memory
            max_lora_rank=8,
            use_async=True,
        )
 else:
        init_args = InitArgs(
            model_name=base_model,
            max_seq_length=32768,
            load_in_4bit=True,  # False for LoRA 16bit
            fast_inference=True,  # Enable vLLM fast inference
            # vLLM args
            disable_log_stats=False,
            enable_prefix_caching=True,
            gpu_memory_utilization=(
                0.79 if enable_sleep_mode else 0.55
            ),  # Reduce if out of memory
            max_lora_rank=8,
            use_async=True,
        )

I believe this would solve the problem, as users can then specify the parameter use_gemma_config and avoid enable_prefix_caching to be added to the arg list.

Let me know if this is not correct or require adaptations.

Thank you very much :)

Jul 15 '25 20:07 Lucas-Fernandes-Martins

Very cool! @bradhilton can you take a look at this one?

Jul 16 '25 19:07 corbt

@Lucas-Fernandes-Martins have you been able to test this? Does it work?

Jul 16 '25 21:07 bradhilton

Hi @corbt and @bradhilton, thank you for your message!

Unfortunately, I spent today doing additional testing and I found something concerning with the solution I proposed.

While solving the enable_prefix_caching issue, another one appears (for some reason I failed to notice this yesterday):

AttributeError: 'Gemma3ForCausalLM' object has no attribute 'vllm_engine'

This seems closely linked to this open issue in Unsloth.

Also, when I try to deactivate vllm altogether, I get:

     63             ctx = zmq.Context(async_ctx)
     64 
---> 65         Which previously had to be::
     66 
     67             ctx = zmq.Context.shadow(async_ctx.underlying)

zmq/backend/cython/context.pyx in zmq.backend.cython.context.Context.__init__()
TypeError: an integer is required

I apologize for opening the pull request so soon, I got carried away that the initial enable_cache_prefix issue was solved. If you feel it is appropriate I'll close the pull request, do more investigation, and try and solve the problem.

I've seen some folks in the community mentioning Gemma 3 would be very useful to have in ART, specially due to its multilingual capabilities, so I'll do my best to try and solve this.

Either way, thank you for the help :)

Jul 17 '25 04:07 Lucas-Fernandes-Martins

Thank you @Lucas-Fernandes-Martins for your investigation. I am afraid that adding Gemma 3 support will likely be tricky.

Jul 17 '25 04:07 bradhilton

@Lucas-Fernandes-Martins it would be great to get Gemma 3 in! Definitely update this PR if you get to a working solution.

Jul 17 '25 12:07 corbt

Hi, thank you for your patience. After a few days of investigation, it seems that the main issue is that Usloth's Gemma 3 doesn't support VLLM. However, I got some news from the Unsloth community that VLLM support for Gemma 3 will soon be released (maybe next week even).

Once this happens, I'll test ART to see if it now works and keep you folks in the loop!

Thanks again :)

Jul 19 '25 02:07 Lucas-Fernandes-Martins

Thank you @Lucas-Fernandes-Martins for investigating!

Jul 19 '25 13:07 bradhilton