cursorless icon indicating copy to clipboard operation
cursorless copied to clipboard

tokenizer: use the new dashed identifier tokenizer for Talon files

Open phillco opened this issue 3 years ago • 11 comments

Dash-separated words are valid inside of Talonscript (e.g. inside of key shortcuts), so I think they should probably use the new dashed tokenizer.

Before:

Screen Shot 2022-10-01 at 3 37 39 PM

After:

Screen Shot 2022-10-01 at 3 47 01 PM

Advantages:

You get the cleanup logic of the subword tokenizer, so you can say "check second word" on:

key(cmd-shift-2)

and it becomes:

key(cmd-2)

Or, with "chuck first word":

key(cmd-shift-2)

and it becomes:

key(shift-2)

It reduces the number of hats needed, increasing the likelihood that other tokens are addressable.

Lastly, I find that it's more intuitive for these to be read as a single token, because they are valid value types in the language. This is similar to Bash, where you can do ./foo bar-baz; in my head, bar-baz is one token. But I've learned this is not universally shared -- @AndreasArvidsson disagrees.

Disadvantages:

You'd have to say "change second word <token>" instead of just "change <other token>".

Subtraction of tokens without spaces between them (a-b) would incorrectly be marked as a single token. I think this is quite rare since most people don't subtraction within Talon files and automatic formatting is becoming more common.

phillco avatar Oct 01 '22 22:10 phillco

I don't think so. The keys are just a string argument and you can call the same action from python. actions.key("shift-p"). You can make variables in Talon script of course but they should be snake case.

AndreasArvidsson avatar Oct 02 '22 07:10 AndreasArvidsson

Hmm yeah I'm not sure these really count as tokens in Talon

Tbh they're not valid identifiers in Bash either, so I have slightly mixed feelings that we treat them as tokens in bash as well 🤔

Eg you can't do foo-bar=baz

pokey avatar Oct 02 '22 15:10 pokey

Tbh they're not valid identifiers in Bash either, so I have slightly mixed feelings that we treat them as tokens in bash as well 🤔

Sure, but they can be value types as well -- you can call cmd foo-bar, and foo-bar is a token. I don't think the definition of a token is only things that can be variables, is it?

The keys are just a string argument and you can call the same action from python.

Don't agree -- these might be strings in Python, but they are tokens in Talonscript.

phillco avatar Oct 02 '22 15:10 phillco

For the most part Talon seems to treat dashes just like python. For example:

simple parse test:
    foo=1
    bar=2
    baz=bar-foo
    insert("{baz}")

Saying "simple parse test" prints out 1

pokey avatar Oct 02 '22 15:10 pokey

Sure, but they can be value types as well -- you can call cmd foo-bar, and foo-bar is a token. I don't think the definition of a token is only things that can be variables, is it?

~I'm not sure they're really behaving like tokens, they're behaving like strings. But I think in this case the utility of referring to them as a single token probably outweighs the technicalities, so I'd be happy to leave it in bash~ sorry i'll stop bikeshedding on bash; I think https://github.com/cursorless-dev/cursorless/pull/986#issuecomment-1264672083 is the more interesting conversation

pokey avatar Oct 02 '22 15:10 pokey

Yeah, you could make an argument that key() and sleep() are really just doing the same thing inside of Talonscript.

So I guess it's just a question of tradeoffs -- how useful is it to be able to navigate within those pseudostring tokens more intuitively, versus the downsides (which so far seemed to be things not being able to identify a-b as a - b, assuming a and b are tokens. But i.m.o. people should just use spaces around subtraction, whereas we will definitely have lots of key shortcuts :P.) Are there other downsides?

phillco avatar Oct 02 '22 15:10 phillco

I think is more useful to have them as separate token. That way you can actually update the key binding without having to use a sub token. Personally I don't know why you would frequently refer to an entire keybinding and in that case we have at least three different modifiers that will do the trick.

AndreasArvidsson avatar Oct 02 '22 16:10 AndreasArvidsson

Tho fwiw one advantage of having them be subtoken is that it will do cleanup if you chuck one of them, ie removing a -

I don't feel super strongly one way or another. I agree it's better to put spaces around subtraction, but it also makes me a little bit nervous to be inconsistent with the parser here, as that could come as a surprise

pokey avatar Oct 02 '22 16:10 pokey

I definitely think that at least the regex needs to be smart enough to differentiate between subtraction and identifiers containing dash.

AndreasArvidsson avatar Oct 02 '22 16:10 AndreasArvidsson

Tho fwiw one advantage of having them be subtoken is that it will do cleanup if you chuck one of them, ie removing a -

Yep, agreed, this is nice to have, and works really well in shellscript.

That way you can actually update the key binding without having to use a sub token. Personally I don't know why you would frequently refer to an entire keybinding...

Basically, my argument is that things that look like they should have subtokens should consistently have them, so that you as the user don't have to maintain the mental state remembering which things are which in the language you're in.

That was my motivation for the Bash change, where I wanted to use "third word" on something like aa-bb-cc and it didn't work, and it took me a while to understand why it didn't work. Sure, there were hats on each of those tokens, so I should have figured it out, but once you get used to being able to use "third word" inside of a token-like thing, you expect it to work inside of all of them. I probably would have just assumed it was a bug in Cursorless if I wasn't more connected to the community and reported it in Slack, where I was helpfully corrected.

The fact that chucking words works better with the cleanup logic is just a bonus. I would also argue there's also slight positive boost to reduce unnecessary tokens.

phillco avatar Oct 02 '22 16:10 phillco

Converted to draft because there wasn't consensus yet -- https://talonvoice.slack.com/archives/C026KPTJE6T/p1664731551894769 was opened to poll users on which one they'd prefer to be the default.

https://github.com/cursorless-dev/cursorless/issues/999 was opened to make the regex configurable by language, so the user can change it if they disagree with the defaults.

phillco avatar Oct 02 '22 17:10 phillco

Replaced by #1078

AndreasArvidsson avatar Oct 24 '22 09:10 AndreasArvidsson