Foundry-Local max_tokens default value is not compatible with OpenAI chat completions API contract

When using the chat completions API, if max_tokens is not set, the model output will truncate at about 512 tokens.

Problem 1: It should not limit the output length when max_tokens is not set by OpenAI contract, as long as it doens't exceed context length. Problem 2: The model returns finish_reason as stop but it should be tokens.

Example script for testing:

#!/usr/bin/env python3
"""
Simple chat completion script for calling localhost:5272/v1 API in streaming mode.
"""
import argparse
from typing import Optional, List, Dict, Any
from openai import OpenAI
from openai.types.chat import ChatCompletionMessageParam


def chat_completion_stream(message: str, api_key: str = "unused", base_url: str = "http://localhost:5272/v1"):
    """
    Send a chat completion request with streaming support.
    
    Args:
        message: The user's message
        api_key: OpenAI API key (default: "unused")
        base_url: API base URL (default: localhost:5272/v1)
    """
    # Initialize OpenAI client
    client = OpenAI(api_key=api_key, base_url=base_url)
    
    # Prepare messages
    messages: List[ChatCompletionMessageParam] = [
        {"role": "user", "content": message}
    ]
    
    try:
        print(f"🚀 Sending message: {message}")
        print(f"🌐 API URL: {base_url}")
        print("🤖 Assistant: ", end="", flush=True)
        
        # Create streaming chat completion
        response = client.chat.completions.create(
            model="qwen2.5-coder-1.5b-instruct-generic-cpu:3",  # You can change this to match your model name
            messages=messages,
            stream=True
        )
        
        # Stream the response
        for chunk in response:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
        
        print()  # New line after streaming
        
    except Exception as e:
        print(f"\n❌ Error: {e}")


def main():
    """Parse command line arguments and start the chat completion."""
    parser = argparse.ArgumentParser(
        description="Simple chat completion with streaming support"
    )
    parser.add_argument(
        "--message",
        type=str,
        default="Write a HTML5 app that helps me to check if a word entered inside a textbox is a palindrome or not",
        help="Message to send (default: 'hello world')"
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default="unused",
        help="OpenAI API key (default: 'unused')"
    )
    parser.add_argument(
        "--base-url",
        type=str,
        default="http://localhost:5272/v1",
        help="API base URL (default: 'http://localhost:5272/v1')"
    )
    
    args = parser.parse_args()
    
    print("💬 Simple Chat Completion Demo")
    print("=" * 40)
    
    # Send the chat completion request
    chat_completion_stream(
        message=args.message,
        api_key=args.api_key,
        base_url=args.base_url
    )


if __name__ == "__main__":
    main()

Example output:


<img width="1016" height="642" alt="Image" src="https://github.com/user-attachments/assets/298460a2-e153-414e-8f0d-7482fce4acf2" />

References: https://platform.openai.com/docs/api-reference/chat/create https://github.com/openai/openai-python/issues/436#issuecomment-1763224848

AB#77381

Sep 15 '25 08:09 a1exwang

Thank you for reporting this @a1exwang. The max tokens for a local model will often have to be less than the context length of the model, due to memory constraints on the device.

We will investigate the stop reason issue.

Sep 27 '25 17:09 natke

+1. Some of the apps (e.g. OpenAI Codex) can't set the max_tokens parameter when calling the API. This makes it impossible to use Foundry Local with these apps. It would also be nice to have the ability to set default values for these parameters in Foundry Local configurations.

Oct 04 '25 05:10 johnliu55-msft