text-generation-webui
text-generation-webui copied to clipboard
EXL2 formatting is busted through the openai like API
Describe the bug
When I query the openai extension's endpoint the output formatting is wrong, however if I make a prompt through they UI, the formatting is fine. I know that int 2023.12.31 snapshot version this was working fine.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
- Enable the openai extension
- Load the the Mixtral exllamav2 3.5bpw model .
- Use the 8-bit cache option.
- Send this prompt to the api endpoint: http://host:port/v1/chat/completions
prompt: `Here is a python method:
def completions_common(body: dict, is_legacy: bool = False, stream=False): object_type = 'text_completion.chunk' if stream else 'text_completion' created_time = int(time.time()) cmpl_id = "conv-%d" % (int(time.time() * 1000000000)) resp_list = 'data' if is_legacy else 'choices'
prompt_str = 'context' if is_legacy else 'prompt'
# ... encoded as a string, array of strings, array of tokens, or array of token arrays.
if prompt_str not in body:
raise InvalidRequestError("Missing required input", param=prompt_str)
# common params
generate_params = process_parameters(body, is_legacy=is_legacy)
max_tokens = generate_params['max_new_tokens']
generate_params['stream'] = stream
requested_model = generate_params.pop('model')
logprob_proc = generate_params.pop('logprob_proc', None)
suffix = body['suffix'] if body['suffix'] else ''
echo = body['echo']
if not stream:
prompt_arg = body[prompt_str]
if isinstance(prompt_arg, str) or (isinstance(prompt_arg, list) and isinstance(prompt_arg[0], int)):
prompt_arg = [prompt_arg]
resp_list_data = []
total_completion_token_count = 0
total_prompt_token_count = 0
for idx, prompt in enumerate(prompt_arg, start=0):
if isinstance(prompt[0], int):
# token lists
if requested_model == shared.model_name:
prompt = decode(prompt)[0]
else:
try:
encoder = tiktoken.encoding_for_model(requested_model)
prompt = encoder.decode(prompt)
except KeyError:
prompt = decode(prompt)[0]
prefix = prompt if echo else ''
token_count = len(encode(prompt)[0])
total_prompt_token_count += token_count
# generate reply #######################################
debug_msg({'prompt': prompt, 'generate_params': generate_params})
generator = generate_reply(prompt, generate_params, is_chat=False)
answer = ''
for a in generator:
answer = a
completion_token_count = len(encode(answer)[0])
total_completion_token_count += completion_token_count
stop_reason = "stop"
if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
stop_reason = "length"
respi = {
"index": idx,
"finish_reason": stop_reason,
"text": prefix + answer + suffix,
"logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
}
resp_list_data.extend([respi])
resp = {
"id": cmpl_id,
"object": object_type,
"created": created_time,
"model": shared.model_name,
resp_list: resp_list_data,
"usage": {
"prompt_tokens": total_prompt_token_count,
"completion_tokens": total_completion_token_count,
"total_tokens": total_prompt_token_count + total_completion_token_count
}
}
yield resp
else:
prompt = body[prompt_str]
if isinstance(prompt, list):
if prompt and isinstance(prompt[0], int):
try:
encoder = tiktoken.encoding_for_model(requested_model)
prompt = encoder.decode(prompt)
except KeyError:
prompt = decode(prompt)[0]
else:
raise InvalidRequestError(message="API Batched generation not yet supported.", param=prompt_str)
prefix = prompt if echo else ''
token_count = len(encode(prompt)[0])
def text_streaming_chunk(content):
# begin streaming
chunk = {
"id": cmpl_id,
"object": object_type,
"created": created_time,
"model": shared.model_name,
resp_list: [{
"index": 0,
"finish_reason": None,
"text": content,
"logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
}],
}
return chunk
yield text_streaming_chunk(prefix)
# generate reply #######################################
debug_msg({'prompt': prompt, 'generate_params': generate_params})
generator = generate_reply(prompt, generate_params, is_chat=False)
answer = ''
seen_content = ''
completion_token_count = 0
for a in generator:
answer = a
len_seen = len(seen_content)
new_content = answer[len_seen:]
if not new_content or chr(0xfffd) in new_content: # partial unicode character, don't send it yet.
continue
seen_content = answer
chunk = text_streaming_chunk(new_content)
yield chunk
completion_token_count = len(encode(answer)[0])
stop_reason = "stop"
if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
stop_reason = "length"
chunk = text_streaming_chunk(suffix)
chunk[resp_list][0]["finish_reason"] = stop_reason
chunk["usage"] = {
"prompt_tokens": token_count,
"completion_tokens": completion_token_count,
"total_tokens": token_count + completion_token_count
}
yield chunk
How can I optimize this? Provide fixed code snippets as well. `
The output formatting will be off.
Screenshot
No response
The output of this prompt was this for me:
Here are some ways to optimize the given Python method (I'll focus on non-functional improvements and readability):1. Use type hints consistently for function arguments and return types2. Use f-strings for string formatting3. Split long lines into multiple lines4. Remove unnecessary parentheses5. Move repeated calculations into separate variables6. Move repeated method calls into separate methods7. Use defaultdict
instead of initializing empty lists8. Utilize itertools
functions for more efficient loops9. Extract magic numbers to constants10. Add docstrings to explain functionalityHere's an example of the refactored code with these improvements applied (note that I couldn't test this without the full context of your application):_________________ import itertools_, collections_, time from typing import Dict,\ dict,\ List,\ Optional,\ Tuple from typing_extensions import Final INPUT_PROMPT_STRS : Final[str] := 'context' IF_LEGACY : Final[bool] := False STREAM : Final[bool] := False MAX_TOKENS : Final[int] := ... REQUESTED_MODEL : Final[str] := ... LOGPROB_PROC : Optional[str]\ =\ None SUFFIX : str [...] ECHO : str [...] PROMPT_ARGUMENTS : List[Union[List[int], str]][][]: List[List[int] ENCODED_PROMPT **kwargs:\n assert isinstance(*kwargs,* list)* total_completion_tokencount **totalCompletionTokenCount*: int totalPromptTokenCount **totalPromptTokenCount*: int modelName **modelName*: str generatorGenerator **generator*: object usageUsage **usageUsage*: dict indexIdx**: int finishReasonFinishReason**: Optional[str]=None logprobsLogProbs**: Optional[Dict[str,* list*]]=\None respiRespi**: dict streamChunkStreamChunk**: Iterator[[Any]]=\ genexpr**vbnet12345678910111213def completionsCommon(body:dict, isLegacy:bool=False, stream:bool=False) -> Iterator[[Dict]]: objectTypeObjectType * finalStringFinalString * requestModelRequestedModel * paramsParams * suffixSuffix * echoEcho * promptArgPromptArg -> Union[:class:list
[:class:int
], str
, list
[:class:str
], list
[:class:list
, optional]] -> Union[:class:tuple
, tuple
, tuple
, optional]]]: objectTypeObjectType *= finalStringFinalString *= requestModelRequestedModel *= paramsParams *= suffixSuffix = echoEcho -= promptArgPromptArg -= processParametersBodyProcessParametersBody(body, isLegacy=isLegacy)-> Dict: generatedParamsGeneratedParams:\n maxTokensMaxTokens *, streamStream *, requestedModelRequestedModel *, logprobProcLogprobProc , kwargs:\n assert isinstance(kwargs, dict) createTimeCreateTime: int cmplIdCmplId: str respListRespList * finalStringFinalString += choicesChoices *"data"Data "promptStrPromptStr ": Any"] if promptStr not in body: raise InvalidRequestError("Missing required input", param=promptStr) generateParamsGenerateParams GenerateParametersBodyProcessParametersBody(body, isLegacy=isLegacy)-> Dict: maxTokensMaxTokens *, streamStream *, requestedModelRequestedModel , logprobProcLogprobProc , *kwargs:\n assert isinstance(kwargs, dict) suffixSuffix += body['suffix'] echoEcho += body['echo'] return {'maxNewTokens': maxTokens, 'stream': stream, 'model': requestedModel} def encodeEncode (textText) -> Tuple [int]: # Implement encoding logic here def decodeDecode (encodedEncoded) -> Tuple [Any]: # Implement decoding logic here def generateReplyGenerateReply (promptPrompt, generateParamsGenerateParams, isChatIsChat: bool = False) -> Iterator [Any]: # Implement reply generation logic here def debugMsgDebugMsg (msgMsg) -> None: # Implement debugging logic here respResponse: Dict = { "id": cmplIdCmplId, "object": objectType, "created": createTimeCreateTime, "model": modelName} resp["usage"] = usageUsage resp["usage"]["promption_tokens"] = totalPromptTokenCount resp["usage"]["completion_tokens"] = totalCompletionTokenCount resp["usage"]["total_tokens"] = totalCompletionTokenCount + totalPromptTokenCount yield resp
Optimizations applied include:
1. Using type hints and default values for function arguments and return types
2. Using f-strings for string formatting
3. Splitting long lines into multiple lines
4. Removing unnecessary parentheses where possible
5. Moving repeated calculations into separate variables (e.g., creating a variable for the response dictionary)
6. Moving repeated method calls into separate methods (e.g., creating a separate method for processing parameters)7. Using `defaultdict` instead of initializing empty lists8. Utilizing `itertools` functions for more efficient loops9. Extracting magic numbers to constants (e.g., defining constants like INPUT_PROMPT_STRS and IF_LEGACY at the top of the file)10
### System Info
```shell
Docker, Nvidia RTX 4090
Hope for concern this question. UI's config not work on openai api .
Has anyone found a workaround for this? I'm having this same issue I believe. GGUF works fine through the API but EXL2 is pretty much useless.