helix icon indicating copy to clipboard operation
helix copied to clipboard

Inconsistent quote matching behaviour with multiple selections

Open aral opened this issue 2 years ago • 8 comments

Summary

See screenshot:

This is the result of a ' on a multiple selection.

On some lines, only a single quote character is added (correct). On some, two.

(It looks like if the line ends with a dot, two single quotes are added. If there is no dot, just one.)

image

Reproduction Steps

See summary.

Helix log

n/a

Platform

Linux

Terminal Emulator

WezTerm

Helix Version

22.12-489-g786c8c8b

aral avatar May 10 '23 15:05 aral

cc @dead10ck

I skimmed the code and it looks like auto pairs doesn't actually look for pairs in case of same-char pairs (like ') and instead simply always inserts a pair for non-alphanumeric chars (which matches your example where sentences that end with . receive a closing char). I ma not sure if we can do better without syntax aware matching (so TS based). I wonder how other editors handle this

pascalkuthe avatar May 10 '23 15:05 pascalkuthe

What if count the number of "same-char" symbols in a line? If a number is odd, then put only one char.

mmaslyukov avatar May 12 '23 07:05 mmaslyukov

That would be a pretty bad heuristic and would often endup to autpairs not firing when they should.

pascalkuthe avatar May 12 '23 10:05 pascalkuthe

Yeah this is unfortunately behaving as designed currently. It intentionally does not auto pair when the preceding char is an alphanumeric in order to handle apostrophes in prose. Using TS for more language aware heuristics might help in some cases, but would be a maintenance heavy approach, since every grammar will have a different set of nodes to look for. I'm not sure what the solution is here.

dead10ck avatar May 15 '23 14:05 dead10ck

There seems to be some prior art around using TS in nvim: https://github.com/windwp/nvim-autopairs#treesitter

It seems they implemented the following rules:

  • for quotes (or really all evenly matched pairs) they use the rule proposed by @mmaslyukov above, I might have been a bit too quick to dismiss that. I played around with it a bit and it actually works better than I expected. Although their handling is a bit more complex (for example you need to somehow handle liftimes 'foo for rust, they do this with a simple regex that matches [^&]' but that's not a great solution since &'f' is valid rust and represents a char reference).
  • slightly smarter regex patterns. These are somewhat language specific tough. (see https://github.com/windwp/nvim-autopairs/blob/7747bbae60074acf0b9e3a4c13950be7a2dff444/lua/nvim-autopairs/rules/basic.lua#L39)
  • they simply specify a list of TS nodes inside of which autoparis are disabled. These are typically the string literal node of the corresponding language. This works pretty well because an unterminated string literal usually spans the entire file (or at least until the next ").

So my idea for solving this would be:

  • Add a language-specific config option that contains a list of nodes inside which auto pairs are disabled. This should not be too much work to maintain since its usually just string or string-literal
  • Add the heuristic suggested by @mmaslyukov but with one twist: also ignore quotes inside the previously specified TS scopes. This would allow us to specify liftime and reference_type as an ignored scope for rust (and make ' pairs work as a result)

pascalkuthe avatar May 15 '23 15:05 pascalkuthe

The problem with TS is that it works very inconsistently depending upon whether it can recognize the node you're starting to type when it's incomplete. Specifying to ignore auto pairs inside line-wise comments or an already-complete string literal would work fine, but not for lifetime nodes. For example, if we wanted to type this line:

let foo: &'static str = "foo";

The complete line parses to:

(let_declaration
   pattern: (identifier)
   type: (reference_type
     (lifetime
       (identifier))
     type: (primitive_type))
   value: (string_literal))

But as we're typing it, e.g. once we get here:

let foo: &'

It does not recognize that we're starting a lifetime specifier

  (ERROR
    (identifier))

We should also consider that this problem also affects plaintext files, as in the example given in this issue, where arguably the omission of the pair is more frequently desired than it is in code, since plaintext is more often prose.

Moreover, I suspect the simple even/odd check will not work well in cases of plaintext prose since any apostrophes will throw off the count, and the result would be that sometimes it omits the auto pair and sometimes it doesn't, seemingly at random depending on how many apostrophes are used within that sentence.

dead10ck avatar May 15 '23 15:05 dead10ck

Hmm yeah for plaintext I think we can't do much better than we are currently doing. Altough the even/odd matching could still be used for double quotes since those are not really used like an apastrope. ' is just overused I gues :D

For programming languages tree sitter could just be added as an enhancement (basically do better if we can). Regarding error nodes TS actually gets this right most of the time:

fn bar(){
    let foo : &
    let foo = 2;
}

here & does have the reference_type node. It only doesn't work for the last let binding in a function:

fn bar(){
    let foo = 2;
    let foo : &
}

we don't have an auto pair for ' in rust at all right now so we could just leave it disabled. But for string literals, this tactic should work really well tough since those will greedily parse all text following them.

pascalkuthe avatar May 15 '23 16:05 pascalkuthe

Oh interesting. Yeah agreed, I think it makes sense to use TS where it can concretely improve the situation, like comments.

But yeah, I share your fear that we can't really do much better for plaintext than we do now. There's simply not enough information to divine when a single quote should be paired and when it's just an apostrophe.

dead10ck avatar May 15 '23 20:05 dead10ck

I think its fair to say that https://github.com/helix-editor/helix/pull/8294 closes this. There are cases where having a plaintext version is useful so I would like to keep the current mi" unchanegd.

For most cases you will be able to use mam/mim with #8294 this will work both on and in quotes. It will also be unambigous since there are usually not other TS nodes inside string literals (so there wouldn't be brackets that would be selected first instead for example)

pascalkuthe avatar Apr 14 '24 01:04 pascalkuthe