azure-data-explorer-datasource icon indicating copy to clipboard operation
azure-data-explorer-datasource copied to clipboard

Enable conditional sending of schema to OpenAI in prompt preamble

Open asimpson opened this issue 2 years ago • 1 comments

Currently the OpenAI integration (#577) does not send any details about the schema or data of the user in the prompt to OpenAI. We should explore sending the schema or at least database and table names along with the prompt which should result in more relevant KQL queries. This should be conditional and opt-out by default. The UX is unsolved here but maybe a checkbox in the header to include schema details works well enough?

Potential issues

  1. Cost The OpenAI API charges per 1k tokens sent (and more if the user uses gpt4). Including the schema or even just the table names in the prompt potentially introduces many more tokens than the user is anticipating. We need to avoid surprise charges from API use. We should at the very least warn the user about this possibility or at best estimate how many tokens will be sent before the request is actually made.

  2. Token limits There is a max of 4096 tokens for the API. Along with :point_up: we should estimate the amount of tokens before send and alert the user if they've hit the limit before sending.

asimpson avatar Jun 09 '23 01:06 asimpson

OpenAI recommends gpt-3-encoder to count tokens like so:

const {encode} = require('gpt-3-encoder')

const string = process.argv[2];
console.log(string);
const encoded = encode(string)
console.log('# of tokens: ', encoded.length)

Compare results to using their tiktoken python module

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
e = enc.encode("hi there bob")
print(len(e))

asimpson avatar Jun 14 '23 19:06 asimpson