max_tokens default value is not compatible with OpenAI chat completions API contract
When using the chat completions API, if max_tokens is not set, the model output will truncate at about 512 tokens.
Problem 1: It should not limit the output length when max_tokens is not set by OpenAI contract, as long as it doens't exceed context length.
Problem 2: The model returns finish_reason as stop but it should be tokens.
Example script for testing:
#!/usr/bin/env python3
"""
Simple chat completion script for calling localhost:5272/v1 API in streaming mode.
"""
import argparse
from typing import Optional, List, Dict, Any
from openai import OpenAI
from openai.types.chat import ChatCompletionMessageParam
def chat_completion_stream(message: str, api_key: str = "unused", base_url: str = "http://localhost:5272/v1"):
"""
Send a chat completion request with streaming support.
Args:
message: The user's message
api_key: OpenAI API key (default: "unused")
base_url: API base URL (default: localhost:5272/v1)
"""
# Initialize OpenAI client
client = OpenAI(api_key=api_key, base_url=base_url)
# Prepare messages
messages: List[ChatCompletionMessageParam] = [
{"role": "user", "content": message}
]
try:
print(f"š Sending message: {message}")
print(f"š API URL: {base_url}")
print("š¤ Assistant: ", end="", flush=True)
# Create streaming chat completion
response = client.chat.completions.create(
model="qwen2.5-coder-1.5b-instruct-generic-cpu:3", # You can change this to match your model name
messages=messages,
stream=True
)
# Stream the response
for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
print() # New line after streaming
except Exception as e:
print(f"\nā Error: {e}")
def main():
"""Parse command line arguments and start the chat completion."""
parser = argparse.ArgumentParser(
description="Simple chat completion with streaming support"
)
parser.add_argument(
"--message",
type=str,
default="Write a HTML5 app that helps me to check if a word entered inside a textbox is a palindrome or not",
help="Message to send (default: 'hello world')"
)
parser.add_argument(
"--api-key",
type=str,
default="unused",
help="OpenAI API key (default: 'unused')"
)
parser.add_argument(
"--base-url",
type=str,
default="http://localhost:5272/v1",
help="API base URL (default: 'http://localhost:5272/v1')"
)
args = parser.parse_args()
print("š¬ Simple Chat Completion Demo")
print("=" * 40)
# Send the chat completion request
chat_completion_stream(
message=args.message,
api_key=args.api_key,
base_url=args.base_url
)
if __name__ == "__main__":
main()
Example output:
<img width="1016" height="642" alt="Image" src="https://github.com/user-attachments/assets/298460a2-e153-414e-8f0d-7482fce4acf2" />
References: https://platform.openai.com/docs/api-reference/chat/create https://github.com/openai/openai-python/issues/436#issuecomment-1763224848
Thank you for reporting this @a1exwang. The max tokens for a local model will often have to be less than the context length of the model, due to memory constraints on the device.
We will investigate the stop reason issue.
+1. Some of the apps (e.g. OpenAI Codex) can't set the max_tokens parameter when calling the API. This makes it impossible to use Foundry Local with these apps. It would also be nice to have the ability to set default values for these parameters in Foundry Local configurations.