Change tokenizer to use gpt-3-encoder for splitting
Find this here:
https://www.npmjs.com/package/gpt-3-encoder
More info:
https://beta.openai.com/tokenizer
#3 adds gpt-3-encoder to sql2llm. references to the other tokenizer used, tokenize_native (via natural.TreebankWordTokenizer()), still exist in workflows/prompt.ts and workflows/embeddings.ts
Real question is whether we should just use this everywhere. I'm thinking yes. Any reason you can think of to not?
I agree, I think it makes sense to use this everywhere. Those two files I listed are the only other places I see any tokenizers being used in the project. (workflows/llm.ts also imports tokenizers but doesn't seem to use them).