turbopilot
turbopilot copied to clipboard
Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer
If we consider the following prompt, then huggingface's tokenizer says there are 1144 tokens whereas the 2B model's log show 1473 tokens. The 6B model's logs show 1222 tokens. I downloaded the models from the google drive and have not quantized myself. I'm not sure of the cause of this discrepancy.
"# Here are some relevant code fragments from other files of the repo:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/add.py\n# --------------------------------------------------\n# import subprocess\n# from typing import Tuple\n# \n# from mindflow.utils.execute import execute_no_trace\n# \n# \n# def run_add(args: Tuple[str]):\n# \"\"\"\n# Add command.\n# \"\"\"\n# command = [\"git\", \"add\"] + list(args)\n# \n# # Execute the git diff command and retrieve the output as a string\n# execute_no_trace(command)\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# return\n# \n# title, body = title_body_tuple\n# \n# command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body] # type: ignore\n# print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n# base_branch, title: Optional[str], body: Optional[str]\n# ) -> Optional[Tuple[str, str]]:\n# settings = Settings()\n# \n# diff_output = run_diff((base_branch,))\n# if not diff_output:\n# diff_output = \"\"\n# \n# title_response: Union[ModelError, str]\n# body_response: Union[ModelError, str]\n# if title is None and body is None:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# from mindflow.utils.prompts import PR_BODY_PREFIX\n# from mindflow.utils.prompts import PR_TITLE_PREFIX\n# \n# \n# def run_pr(args: Tuple[str], title: Optional[str] = None, body: Optional[str] = None):\n# base_branch = get_flag_value(args, [\"--base\", \"-B\"])\n# \n# if base_branch is None:\n# # Determine the name of the default branch\n# base_branch = (\n# subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n# .decode()\n# .strip()\n# .split(\"/\")[-1]\n# )\n# \n# if not title or not body:\n# title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n# if not title_body_tuple:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n# .decode()\n# .strip()\n# .split(\"/\")[-1]\n# )\n# \n# if not title or not body:\n# title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n# if not title_body_tuple:\n# return\n# \n# title, body = title_body_tuple\n# \n# command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body] # type: ignore\n# print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n# base_branch, title: Optional[str], body: Optional[str]\n# --------------------------------------------------\n\nfrom typing import Optional, Tuple, List\n\nfrom mindflow.core.git.pr import create_title_and_body\nfrom mindflow.utils.command_parse import get_flag_value\nfrom mindflow.utils.execute import execute_no_trace\n\n\ndef run_mr(\n args: Tuple[str], title: Optional[str] = None, description: Optional[str] = None\n):\n base_branch = get_flag_value(args, [\"--target-branch\", \"-b\"])\n\n if base_branch is None:\n # Determine the name of the default branch\n base_branch = (\n subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])"
Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm).
I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically.
Was there a specific use case you have in mind that this is blocking?
Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs.
Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted!
Thanks! I just created a PR here to allow pretokenized inputs: https://github.com/ravenscroftj/ggml/pull/2
It seems to work fine for me.
That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway.
Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it!
Thanks!
I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and found about 15% degradation in the exact match metric at first line. Here's how the graph looks like:

I manually looked at some of the mispredictions and they seemed okay to me, but were getting penalized because it wasn't an exact match. I think one interesting thing to do would be to check how different the probabilities of the 16bit and 4bit predictions are