sql-highlight icon indicating copy to clipboard operation
sql-highlight copied to clipboard

improve operator detection

Open wkeese opened this issue 2 years ago • 10 comments
trafficstars

There are multiple multi-character operators in SQL such as >=, <=, <>, and maybe others. Currently, each of them except for != is split into multiple segments, for example:

{ name: 'special', content: '>' },
{ name: 'special', content: '=' },

This precludes some (likely uncommon) HTML formatting, such as drawing a box around each operator.

Ideally, these multi-character operators would be classified as a single segment, i.e.

{ name: 'special', content: '>=' },

For example (if building on top of my previous PR #148):

it('operators', () => {
  expect(getSegments('AGE >= 45'))
    .toStrictEqual([
      { name: 'identifier', content: 'AGE' },
      { name: 'whitespace', content: ' ' }
      { name: 'special', content: '>=' },
      { name: 'whitespace', content: ' ' },
      { name: 'number', content: '45' },
    ])
})

The current regexp for "special" segments is:

/(?<special>!=|[=%*/\-+,;:<>])/

Besides the problem/inefficiency I listed above, another problem with that regexp is that it misses some single-character operators, for example & and |. See https://www.w3schools.com/sql/sql_operators.asp for a list.

For those two reasons, I wonder if instead of the current regexp, we should just look for all sequences of punctuation, excluding the quotation marks that start strings and identifiers:

/(?<special>[^\w\s"'`]+)/

Admittedly as I wrote in #149 it's sometimes hard to differentiate between subtraction and negative numbers, and the regexp above would exacerbate that problem in corner cases like AGE*-5. It would work fine with spaces though, i.e. AGE * -5.

wkeese avatar Oct 05 '23 02:10 wkeese

This issue has been marked as stale because it has been open for 14 days with no activity. Remove the stale label or comment or this will be closed in 5 days.

scriptsbot avatar Oct 20 '23 02:10 scriptsbot

Ping.

wkeese avatar Oct 20 '23 02:10 wkeese

Hi, i have the same problem

another problem with that regexp is that it misses some single-character operators, for example & and |. See https://www.w3schools.com/sql/sql_operators.asp for a list.

Quick solution, add &|^ to regex for "special" segments

/(?<special>!=|[=%*/\-+,;:<>&\|\^])/

parallels999 avatar Jan 23 '24 21:01 parallels999

Hey, sorry for not replying here. Simply put I haven't had that much time the last few weeks due to various reasons. But next week is vacation, maybe I'll have some time then :smile: No promises though

scriptcoded avatar Feb 01 '24 11:02 scriptcoded

Quick solution, add &|^ to regex for "special" segments

/(?<special>!=|[=%*/\-+,;:<>&\|\^])/

That works for me, but could you do a PR?

PaolaRuby avatar Mar 11 '24 17:03 PaolaRuby

Wait, that doesn't address the main problem I listed in this issue. Why not just go with what I originally suggested?

/(?<special>[^\w\s"'`]+)/

wkeese avatar Mar 12 '24 00:03 wkeese

As @wkeese mentioned in the description a fix for this issue might (or more likely will) affect #149. I'm very keen on resolving the number detection issue, and I think that coming up with a good solution to that issue first would be the better approach here. Then we could look at how to adapt the matching logic introduced there to work with multi character operators.

To add my five cents to the conversation regarding different regexes, without having actually tested these myself I believe implementing the more generic regex suggested by @wkeese would generally be a better approach since it takes away a bit of the burden maintaining that character list. That being there might be a risk it's a bit too greedy, so the tests will have to tell if it will actually work.

scriptcoded avatar Mar 12 '24 08:03 scriptcoded

I believe implementing the more generic regex suggested by @wkeese would generally be a better approach since it takes away a bit of the burden maintaining that character list. That being there might be a risk it's a bit too greedy, so the tests will have to tell if it will actually work.

You made a lot of good points.

ISTM the root issue, probably impractical to solve, is that + and - can be unary or binary operators. So regardless of whether we have an explicit list of operators, this can be tokenized two ways:

x<>-5

And so can this:

x-5

But those are edge cases due to the lack of spaces.

I do think it makes sense for number matching to take priority as that will do the right thing for common expressions like

x > -5

wkeese avatar Mar 30 '24 05:03 wkeese

PS: I added a PR #149 for the numbers. I points to the beta branch since presumably the master branch is now just for bug fixes.

wkeese avatar Apr 02 '24 08:04 wkeese

I added #193, now we match all operators. Actually, it's using "special" as the bucket for anything that doesn't match the other regexps (number, identifier, string, etc.). But I added lots of tests and I've convinced myself that it isn't too greedy. In particular, that minus, plus, or dot before a digit is treated as part of the number.

wkeese avatar Apr 13 '24 23:04 wkeese

The fix for this in #193 has been merged to the beta branch and will be released very soon. I'll close this a bit prematurely, but feel free to reopen if you feel it's not fixed.

scriptcoded avatar Jul 02 '24 20:07 scriptcoded

:tada: This issue has been resolved in version 6.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

scriptsbot avatar Jul 02 '24 21:07 scriptsbot