platform icon indicating copy to clipboard operation
platform copied to clipboard

Change tokenizer to use gpt-3-encoder for splitting

Open mbusigin opened this issue 2 years ago • 3 comments

Find this here:

https://www.npmjs.com/package/gpt-3-encoder

More info:

https://beta.openai.com/tokenizer

mbusigin avatar Jan 24 '23 13:01 mbusigin

#3 adds gpt-3-encoder to sql2llm. references to the other tokenizer used, tokenize_native (via natural.TreebankWordTokenizer()), still exist in workflows/prompt.ts and workflows/embeddings.ts

marcgreen avatar Jan 29 '23 21:01 marcgreen

Real question is whether we should just use this everywhere. I'm thinking yes. Any reason you can think of to not?

mbusigin avatar Jan 30 '23 03:01 mbusigin

I agree, I think it makes sense to use this everywhere. Those two files I listed are the only other places I see any tokenizers being used in the project. (workflows/llm.ts also imports tokenizers but doesn't seem to use them).

marcgreen avatar Jan 31 '23 23:01 marcgreen