cursorless icon indicating copy to clipboard operation
cursorless copied to clipboard

avoid allocating hats to the first letter of a token

Open josharian opened this issue 2 years ago • 7 comments
trafficstars

We could get much fancier than this, but after running this with a day it appears to help some, and it is nice and simple.

I propose that we declare that it fixes #1658, at least for now.

Checklist

  • [/] I have added tests
  • [/] I have updated the docs and cheatsheet
  • [/] I have not broken the cheatsheet

josharian avatar Aug 02 '23 23:08 josharian

I plan to keep running this for a little while longer, gathering data, but I thought I would share it in case anyone else wants to play with it.

(I know the tests are busted.)

josharian avatar Aug 02 '23 23:08 josharian

here's another rev. lots of tests are still failing; it's going to be tedious to fix them, so I'd like to wait until we are relatively confident in the rest of the direction.

josharian avatar Aug 08 '23 02:08 josharian

notes to self:

  • correctly handle _abcTest (are we avoiding _ or a?)
  • perf test
  • maybe re-use tokenizers
  • switch to ranges
  • tests: stats, fixtures
  • data gathering for end users
    • no phones/replace
    • jsonl
    • open append/exclusive
    • command payload
    • rotate monthly
    • include extension version

josharian avatar Aug 12 '23 01:08 josharian

update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form

pokey avatar Jun 20 '24 10:06 pokey

update: @AndreasArvidsson is going to have a look and take this one home if it's pretty close to mergeable in its current form

great, thanks!

josharian avatar Jun 25 '24 00:06 josharian

@josharian Have you evaluated the difference between just avoiding the first character in the token verses the first character in every subword? When I first thought about this problem I kinda just envisioned the first character in the token, but your implementation is doing every subword which could be better. Any insight?

AndreasArvidsson avatar Jun 25 '24 04:06 AndreasArvidsson

I remember thinking at the time that doing sub words was important. But It is not something I ever gathered data about, because the effects are purely qualitative. And a lot of time has now gone by…

josharian avatar Jun 25 '24 04:06 josharian

I just did some performance tests. Using a single editor with typescript the hat allocation went from about 6ms to 8ms. Percentage wise quite a lot, but two milliseconds we can live with.

AndreasArvidsson avatar Feb 22 '25 14:02 AndreasArvidsson