text-generation-webui EXL2 formatting is busted through the openai like API

Describe the bug

When I query the openai extension's endpoint the output formatting is wrong, however if I make a prompt through they UI, the formatting is fine. I know that int 2023.12.31 snapshot version this was working fine.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Enable the openai extension
Load the the Mixtral exllamav2 3.5bpw model .
Use the 8-bit cache option.
Send this prompt to the api endpoint: http://host:port/v1/chat/completions

prompt: `Here is a python method:

def completions_common(body: dict, is_legacy: bool = False, stream=False): object_type = 'text_completion.chunk' if stream else 'text_completion' created_time = int(time.time()) cmpl_id = "conv-%d" % (int(time.time() * 1000000000)) resp_list = 'data' if is_legacy else 'choices'

prompt_str = 'context' if is_legacy else 'prompt'

# ... encoded as a string, array of strings, array of tokens, or array of token arrays.
if prompt_str not in body:
    raise InvalidRequestError("Missing required input", param=prompt_str)

# common params
generate_params = process_parameters(body, is_legacy=is_legacy)
max_tokens = generate_params['max_new_tokens']
generate_params['stream'] = stream
requested_model = generate_params.pop('model')
logprob_proc = generate_params.pop('logprob_proc', None)
suffix = body['suffix'] if body['suffix'] else ''
echo = body['echo']

if not stream:
    prompt_arg = body[prompt_str]
    if isinstance(prompt_arg, str) or (isinstance(prompt_arg, list) and isinstance(prompt_arg[0], int)):
        prompt_arg = [prompt_arg]

    resp_list_data = []
    total_completion_token_count = 0
    total_prompt_token_count = 0

    for idx, prompt in enumerate(prompt_arg, start=0):
        if isinstance(prompt[0], int):
            # token lists
            if requested_model == shared.model_name:
                prompt = decode(prompt)[0]
            else:
                try:
                    encoder = tiktoken.encoding_for_model(requested_model)
                    prompt = encoder.decode(prompt)
                except KeyError:
                    prompt = decode(prompt)[0]

        prefix = prompt if echo else ''
        token_count = len(encode(prompt)[0])
        total_prompt_token_count += token_count

        # generate reply #######################################
        debug_msg({'prompt': prompt, 'generate_params': generate_params})
        generator = generate_reply(prompt, generate_params, is_chat=False)
        answer = ''

        for a in generator:
            answer = a

        completion_token_count = len(encode(answer)[0])
        total_completion_token_count += completion_token_count
        stop_reason = "stop"
        if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
            stop_reason = "length"

        respi = {
            "index": idx,
            "finish_reason": stop_reason,
            "text": prefix + answer + suffix,
            "logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
        }

        resp_list_data.extend([respi])

    resp = {
        "id": cmpl_id,
        "object": object_type,
        "created": created_time,
        "model": shared.model_name,
        resp_list: resp_list_data,
        "usage": {
            "prompt_tokens": total_prompt_token_count,
            "completion_tokens": total_completion_token_count,
            "total_tokens": total_prompt_token_count + total_completion_token_count
        }
    }

    yield resp
else:
    prompt = body[prompt_str]
    if isinstance(prompt, list):
        if prompt and isinstance(prompt[0], int):
            try:
                encoder = tiktoken.encoding_for_model(requested_model)
                prompt = encoder.decode(prompt)
            except KeyError:
                prompt = decode(prompt)[0]
        else:
            raise InvalidRequestError(message="API Batched generation not yet supported.", param=prompt_str)

    prefix = prompt if echo else ''
    token_count = len(encode(prompt)[0])

    def text_streaming_chunk(content):
        # begin streaming
        chunk = {
            "id": cmpl_id,
            "object": object_type,
            "created": created_time,
            "model": shared.model_name,
            resp_list: [{
                "index": 0,
                "finish_reason": None,
                "text": content,
                "logprobs": {'top_logprobs': [logprob_proc.token_alternatives]} if logprob_proc else None,
            }],
        }

        return chunk

    yield text_streaming_chunk(prefix)

    # generate reply #######################################
    debug_msg({'prompt': prompt, 'generate_params': generate_params})
    generator = generate_reply(prompt, generate_params, is_chat=False)

    answer = ''
    seen_content = ''
    completion_token_count = 0

    for a in generator:
        answer = a

        len_seen = len(seen_content)
        new_content = answer[len_seen:]

        if not new_content or chr(0xfffd) in new_content:  # partial unicode character, don't send it yet.
            continue

        seen_content = answer
        chunk = text_streaming_chunk(new_content)
        yield chunk

    completion_token_count = len(encode(answer)[0])
    stop_reason = "stop"
    if token_count + completion_token_count >= generate_params['truncation_length'] or completion_token_count >= max_tokens:
        stop_reason = "length"

    chunk = text_streaming_chunk(suffix)
    chunk[resp_list][0]["finish_reason"] = stop_reason
    chunk["usage"] = {
        "prompt_tokens": token_count,
        "completion_tokens": completion_token_count,
        "total_tokens": token_count + completion_token_count
    }

    yield chunk

How can I optimize this? Provide fixed code snippets as well. `

The output formatting will be off.

Screenshot

No response

The output of this prompt was this for me:

Here are some ways to optimize the given Python method (I'll focus on non-functional improvements and readability):1. Use type hints consistently for function arguments and return types2. Use f-strings for string formatting3. Split long lines into multiple lines4. Remove unnecessary parentheses5. Move repeated calculations into separate variables6. Move repeated method calls into separate methods7. Use defaultdict instead of initializing empty lists8. Utilize itertools functions for more efficient loops9. Extract magic numbers to constants10. Add docstrings to explain functionalityHere's an example of the refactored code with these improvements applied (note that I couldn't test this without the full context of your application):_________________ import itertools_, collections_, time from typing import Dict,\ dict,\ List,\ Optional,\ Tuple from typing_extensions import Final INPUT_PROMPT_STRS : Final[str] := 'context' IF_LEGACY : Final[bool] := False STREAM : Final[bool] := False MAX_TOKENS : Final[int] := ... REQUESTED_MODEL : Final[str] := ... LOGPROB_PROC : Optional[str]\ =\ None SUFFIX : str [...] ECHO : str [...] PROMPT_ARGUMENTS : List[Union[List[int], str]][][]: List[List[int] ENCODED_PROMPT **kwargs:\n assert isinstance(*kwargs,* list)* total_completion_tokencount **totalCompletionTokenCount*: int totalPromptTokenCount **totalPromptTokenCount*: int modelName **modelName*: str generatorGenerator **generator*: object usageUsage **usageUsage*: dict indexIdx**: int finishReasonFinishReason**: Optional[str]=None logprobsLogProbs**: Optional[Dict[str,* list*]]=\None respiRespi**: dict streamChunkStreamChunk**: Iterator[[Any]]=\ genexpr**vbnet12345678910111213def completionsCommon(body:dict, isLegacy:bool=False, stream:bool=False) -> Iterator[[Dict]]: objectTypeObjectType * finalStringFinalString * requestModelRequestedModel * paramsParams * suffixSuffix * echoEcho * promptArgPromptArg -> Union[:class:list[:class:int], str, list[:class:str], list[:class:list, optional]] -> Union[:class:tuple, tuple, tuple, optional]]]: objectTypeObjectType *= finalStringFinalString *= requestModelRequestedModel *= paramsParams *= suffixSuffix = echoEcho -= promptArgPromptArg -= processParametersBodyProcessParametersBody(body, isLegacy=isLegacy)-> Dict: generatedParamsGeneratedParams:\n maxTokensMaxTokens *, streamStream *, requestedModelRequestedModel *, logprobProcLogprobProc , kwargs:\n assert isinstance(kwargs, dict) createTimeCreateTime: int cmplIdCmplId: str respListRespList * finalStringFinalString += choicesChoices *"data"Data "promptStrPromptStr ": Any"] if promptStr not in body: raise InvalidRequestError("Missing required input", param=promptStr) generateParamsGenerateParams GenerateParametersBodyProcessParametersBody(body, isLegacy=isLegacy)-> Dict: maxTokensMaxTokens *, streamStream *, requestedModelRequestedModel , logprobProcLogprobProc , *kwargs:\n assert isinstance(kwargs, dict) suffixSuffix += body['suffix'] echoEcho += body['echo'] return {'maxNewTokens': maxTokens, 'stream': stream, 'model': requestedModel} def encodeEncode (textText) -> Tuple [int]: # Implement encoding logic here def decodeDecode (encodedEncoded) -> Tuple [Any]: # Implement decoding logic here def generateReplyGenerateReply (promptPrompt, generateParamsGenerateParams, isChatIsChat: bool = False) -> Iterator [Any]: # Implement reply generation logic here def debugMsgDebugMsg (msgMsg) -> None: # Implement debugging logic here respResponse: Dict = { "id": cmplIdCmplId, "object": objectType, "created": createTimeCreateTime, "model": modelName} resp["usage"] = usageUsage resp["usage"]["promption_tokens"] = totalPromptTokenCount resp["usage"]["completion_tokens"] = totalCompletionTokenCount resp["usage"]["total_tokens"] = totalCompletionTokenCount + totalPromptTokenCount yield resp

Optimizations applied include:

1. Using type hints and default values for function arguments and return types
2. Using f-strings for string formatting
3. Splitting long lines into multiple lines
4. Removing unnecessary parentheses where possible
5. Moving repeated calculations into separate variables (e.g., creating a variable for the response dictionary)
6. Moving repeated method calls into separate methods (e.g., creating a separate method for processing parameters)7. Using `defaultdict` instead of initializing empty lists8. Utilizing `itertools` functions for more efficient loops9. Extracting magic numbers to constants (e.g., defining constants like INPUT_PROMPT_STRS and IF_LEGACY at the top of the file)10



### System Info

```shell
Docker, Nvidia RTX 4090

Mar 14 '24 11:03 MrMojoR

Hope for concern this question. UI's config not work on openai api .

Apr 09 '24 15:04 LeoYelton

Has anyone found a workaround for this? I'm having this same issue I believe. GGUF works fine through the API but EXL2 is pretty much useless.

Apr 30 '24 02:04 abbail

text-generation-webui text-generation-webui copied to clipboard

EXL2 formatting is busted through the openai like API

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

The output of this prompt was this for me:

text-generation-webui
text-generation-webui copied to clipboard