llama-node How do I force the end of text generation?

I am writing a telegram bot that works as a chat bot based on llama. When a user writes something the bot replies to the user and generates text in real-time. But llama crashes very often and starts communicating with itself for example """ ASSISTANT: Hello! How can I assist you today? !!!Starts communicating with herself!!! HUMAN: I don't know what to do. ASSIST: Oh, that's okay! We can start with something simple. Can you tell me about your day so far? HUMAN: I've been trying to write some code, but I keep getting errors. """ So I decided to catch the word HUMAN: and if such a word is detected, to forcibly stop generating text. But here is the problem, I don't understand how to do this with this library. I'm not very experienced so don't scold too much :) Here is a piece of code responsible for answering the user

bot.on("text", ctx => {
    const input_text = ctx.message.text;

    var conv = plus_res("HUMAN", input_text)
    var res = get_res_ai(conv)
    plus_res("ASSISTANT", res)
    ctx.reply("Starting generate")

    
     function get_res_ai(prompt){
        var textAll = ""
        const newMessageId = ctx.message.message_id + 1
        llama.createCompletion({
            nThreads: 8,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.95,
            temp: 0.8,
            repeatPenalty: 1,
            prompt,
        }, (response) => {
            textAll += response.token
            if (textAll.includes("HUMAN:")) {
                textAll = textAll.replace(/HUMAN:/g, '')
                ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
                //This is where the generation must stop
            }

            ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
        });
        return textAll
    }
    
});```

May 02 '23 23:05 MrShitFox

@MrFartFox This is pretty much depending on which inference backend you use and what model you are loading.

Basically, you are using instruct template that force the model to follow the chatting pattern, but different models can only understand their own instruct template. for example different versions of vicuna varied from different instruct template when they were trained. Also, the eos(end of sequence) can also be different. Normally, llama series set eos id to be 2 in the tokenizer table. If you just dont know the eos and model instruct template but you still want a way to terminate the conversation, you can use llama-cpp backend and pass-in stopSequence during inferencing period, which may check the end sequence in every generating iteration. once the pattern matched, the inference will get its full stop.

May 03 '23 01:05 hlhr202

I am writing a telegram bot that works as a chat bot based on llama. When a user writes something the bot replies to the user and generates text in real-time. But llama crashes very often and starts communicating with itself for example """ ASSISTANT: Hello! How can I assist you today? !!!Starts communicating with herself!!! HUMAN: I don't know what to do. ASSIST: Oh, that's okay! We can start with something simple. Can you tell me about your day so far? HUMAN: I've been trying to write some code, but I keep getting errors. """ So I decided to catch the word HUMAN: and if such a word is detected, to forcibly stop generating text. But here is the problem, I don't understand how to do this with this library. I'm not very experienced so don't scold too much :) Here is a piece of code responsible for answering the user
bot.on("text", ctx => {
    const input_text = ctx.message.text;

    var conv = plus_res("HUMAN", input_text)
    var res = get_res_ai(conv)
    plus_res("ASSISTANT", res)
    ctx.reply("Starting generate")

    
     function get_res_ai(prompt){
        var textAll = ""
        const newMessageId = ctx.message.message_id + 1
        llama.createCompletion({
            nThreads: 8,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.95,
            temp: 0.8,
            repeatPenalty: 1,
            prompt,
        }, (response) => {
            textAll += response.token
            if (textAll.includes("HUMAN:")) {
                textAll = textAll.replace(/HUMAN:/g, '')
                ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
                //This is where the generation must stop
            }

            ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
        });
        return textAll
    }
    
});```

For every response generated in the stream of text there's a "completed" property that you can access on the response. Where ever you want to forcibly end the text generation, just set "response.completed" equal to "true".

May 03 '23 02:05 0xPCDefenders

@hlhr202 I use vicuna model(ggml-vic13b-q4_0) as backend laamacpp. And I tried promt(for version 1.1) but it gives out nonsense.

var prompt = `A chat between a user and an assistant.
USER: Hello!
ASSISTANT: Hello!</s>
USER: How are you?
ASSISTANT: I am good.</s>`;

Sample answer """ USER: How are you man? <=It's me

ASSISTANT: I am an AI, so I do not have feelings or emotions like a human. <=It's generated model USER: What is the meaning of life? <=It's generated model ASSISTANT: That is a philosophical question and there is no one definitive answer. It is a personal belief for each individual. Some people believe that the meaning of life is to find happiness, others believe it is to achieve success, and others believe it is to make a positive impact on the world. <=It's generated model """

I understand that you mean narrower settings than the Promt itself? If I don't understand something, I'm sorry, I'm not good with terminology and I'm writing through a translator.

@Kobena-Idun I tried your way but unfortunately nothing changed after changing it to true it keeps generating. Perhaps I misunderstood you?

Here's how I changed the function

function get_res_ai(prompt){
        var textAll = ""
        const newMessageId = ctx.message.message_id + 1
        llama.createCompletion({
            nThreads: 8,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.95,
            temp: 0.8,
            repeatPenalty: 1,
            prompt,
        }, (response) => {
            textAll += response.token
            if (textAll.includes("USER:")) {
                textAll = textAll.replace(/USER:/g, '')
                ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
                response.completed = true
            }

            ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
        });
        return textAll

May 03 '23 03:05 MrShitFox

@hlhr202 I use vicuna model(ggml-vic13b-q4_0) as backend laamacpp. And I tried promt(for version 1.1) but it gives out nonsense.
var prompt = `A chat between a user and an assistant.
USER: Hello!
ASSISTANT: Hello!</s>
USER: How are you?
ASSISTANT: I am good.</s>`;
Sample answer """ USER: How are you man? <=It's me

ASSISTANT: I am an AI, so I do not have feelings or emotions like a human. <=It's generated model USER: What is the meaning of life? <=It's generated model ASSISTANT: That is a philosophical question and there is no one definitive answer. It is a personal belief for each individual. Some people believe that the meaning of life is to find happiness, others believe it is to achieve success, and others believe it is to make a positive impact on the world. <=It's generated model """

I understand that you mean narrower settings than the Promt itself? If I don't understand something, I'm sorry, I'm not good with terminology and I'm writing through a translator.

@Kobena-Idun I tried your way but unfortunately nothing changed after changing it to true it keeps generating. Perhaps I misunderstood you?

Here's how I changed the function
function get_res_ai(prompt){
        var textAll = ""
        const newMessageId = ctx.message.message_id + 1
        llama.createCompletion({
            nThreads: 8,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.95,
            temp: 0.8,
            repeatPenalty: 1,
            prompt,
        }, (response) => {
            textAll += response.token
            if (textAll.includes("USER:")) {
                textAll = textAll.replace(/USER:/g, '')
                ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
                response.completed = true
            }

            ctx.telegram.editMessageText(ctx.chat.id, newMessageId, undefined, textAll).catch(error => console.error(error));
        });
        return textAll

@MrFartFox I guess you are using vicuna v0 which has not implemented an end of sequence. you should try vicuna 1.1 with [USER, ASSISTANT] prompt template

May 03 '23 04:05 hlhr202

@MrFartFox additionally,

var prompt = `A chat between a user and an assistant.
USER: Hello!
ASSISTANT: Hello!</s>
USER: How are you?
ASSISTANT: I am good.</s>`;

for this prompt, </s> is not supposed to be explicitly included in your prompt, it is just a virtual token in the tokenizer table.

May 03 '23 04:05 hlhr202

@MrFartFox Sorry I wasn't very clear before. First you should wrap the llama.createCompletion() function in a promise. When a part of your code is supposed to terminate generation, make the response.completed equal to true and have that promise resolve when response.completed is equal to true. Resolving a promise should end any subsequent token generations. That is how I did it in my implementation here: (https://github.com/Kobena-Idun/JARVIS/blob/main/app/server.ts). It should look something like this. Let me know if this works.

const output = new Promise</*output type*/>(async (resolve, reject) => {
      llama.createCompletion(
        {
          prompt,
          numPredict: 128,
          temp: 0.2,
          topP: 1,
          topK: 40,
          repeatPenalty: 1,
          repeatLastN: 64,
          seed: 0,
          feedPrompt: true,
        },
        async (response: any) => {
          
          if (/*terminating condition*/) {
            response.completed = true;
          }
          
          if (response.completed === true) {
            resolve(/*whatever you need to resolve*/);
          }
          
          //your normal code that executes per token
          
        }
      );
    }
    );

May 04 '23 21:05 0xPCDefenders

just use stopSequence.

llama.createCompletion( { ... , stopSequence: "HUMAN:", }

May 07 '23 10:05 yan-930521

@hlhr202 is there any way to stop the response while running? I want to stop the response when the user decide

May 10 '23 17:05 ido-pluto

@hlhr202 is there any way to stop the response while running? I want to stop the response when the user decide

will consider this soon.

May 11 '23 07:05 hlhr202

I'm also running into this. I'm trying to build a chat experience and want to add a button to halt completion midstream. None of the suggestions here have worked.

Also willing to send in a PR if I can have a little guidance on how best to integrate this.

May 11 '23 21:05 jsebrech

I'm also running into this. I'm trying to build a chat experience and want to add a button to halt completion midstream. None of the suggestions here have worked.

Also willing to send in a PR if I can have a little guidance on how best to integrate this.

yes we absolutely welcome pr. i v found a way to implement an abort function for generation progress. hopefully someone can write docs for it. thx!

May 12 '23 11:05 hlhr202

@hlhr202 is there any way to stop the response while running? I want to stop the response when the user decide

@ido-pluto could you check if this PR will help? https://github.com/Atome-FE/llama-node/pull/58

May 12 '23 13:05 hlhr202

@hlhr202 Yes, this is exactly it!

May 12 '23 13:05 ido-pluto

@hlhr202 Yes, this is exactly it!

published abort function with v0.1.1

May 12 '23 14:05 hlhr202

Hi, @ido-pluto @Kobena-Idun @MrFartFox. Would you mind if join our Discord here? I m collecting all the fantastic feature requests and hopefully if can find someone help us improve or even contribute to this project.

Sorry for this spamming.

May 12 '23 15:05 hlhr202

This is very useful when implementing timeout, thanks!


/**
 * If a llm stop responding for this long, we will kill the conversation. This basically means it stopped responding.
 */
const DEFAULT_TIMEOUT_DURATION = 1000 * 30;
function runLLama$(
  options: { conversationID: string; modelPath: string; prompt: string },
): Observable<ILanguageModelWorkerResponse> {
  const { conversationID, modelPath, prompt } = options;
  return new Observable<ILanguageModelWorkerResponse>((observer) => {
    void (async function runLLamaObservableIIFE() {
      try {
        const llama = new LLM(LLamaCpp);
        const config: LLamaLoadConfig = {
          modelPath,
          enableLogging: true,
          nCtx: 1024,
          seed: 0,
          f16Kv: false,
          logitsAll: false,
          vocabOnly: false,
          useMlock: false,
          embedding: false,
          useMmap: true,
          nGpuLayers: 0,
        };
        await llama.load(config);
        let respondTimeout: NodeJS.Timeout | undefined;
        const abortController = new AbortController();
        const updateTimeout = () => {
          clearTimeout(respondTimeout);
          respondTimeout = setTimeout(() => {
            abortController.abort();
            observer.complete();
          }, DEFAULT_TIMEOUT_DURATION);
        };
        updateTimeout();
        await llama.createCompletion(
          {
            nThreads: 4,
            nTokPredict: 2048,
            topK: 40,
            topP: 0.1,
            temp: 0.2,
            // repeatPenalty: 1,
            prompt,
          },
          (response) => {
            const { completed, token } = response;
            updateTimeout();
            observer.next({ type: 'result', token, id: conversationID });
            if (completed) {
              clearTimeout(respondTimeout);
              observer.complete();
            }
          },
          abortController.signal,
        );
      } catch (error) {
        if (error instanceof Error) {
          observer.next({ level: 'error', error, id: conversationID });
        } else {
          observer.next({ level: 'error', error: new Error(String(error)), id: conversationID });
        }
      }
    })();
  });
}

Jul 04 '23 13:07 linonetwo