sql-highlight
sql-highlight copied to clipboard
improve operator detection
There are multiple multi-character operators in SQL such as >=, <=, <>, and maybe others. Currently, each of them except for != is split into multiple segments, for example:
{ name: 'special', content: '>' },
{ name: 'special', content: '=' },
This precludes some (likely uncommon) HTML formatting, such as drawing a box around each operator.
Ideally, these multi-character operators would be classified as a single segment, i.e.
{ name: 'special', content: '>=' },
For example (if building on top of my previous PR #148):
it('operators', () => {
expect(getSegments('AGE >= 45'))
.toStrictEqual([
{ name: 'identifier', content: 'AGE' },
{ name: 'whitespace', content: ' ' }
{ name: 'special', content: '>=' },
{ name: 'whitespace', content: ' ' },
{ name: 'number', content: '45' },
])
})
The current regexp for "special" segments is:
/(?<special>!=|[=%*/\-+,;:<>])/
Besides the problem/inefficiency I listed above, another problem with that regexp is that it misses some single-character operators, for example & and |. See https://www.w3schools.com/sql/sql_operators.asp for a list.
For those two reasons, I wonder if instead of the current regexp, we should just look for all sequences of punctuation, excluding the quotation marks that start strings and identifiers:
/(?<special>[^\w\s"'`]+)/
Admittedly as I wrote in #149 it's sometimes hard to differentiate between subtraction and negative numbers, and the regexp above would exacerbate that problem in corner cases like AGE*-5. It would work fine with spaces though, i.e. AGE * -5.
This issue has been marked as stale because it has been open for 14 days with no activity. Remove the stale label or comment or this will be closed in 5 days.
Ping.
Hi, i have the same problem
another problem with that regexp is that it misses some single-character operators, for example
&and|. See https://www.w3schools.com/sql/sql_operators.asp for a list.
Quick solution, add &|^ to regex for "special" segments
/(?<special>!=|[=%*/\-+,;:<>&\|\^])/
Hey, sorry for not replying here. Simply put I haven't had that much time the last few weeks due to various reasons. But next week is vacation, maybe I'll have some time then :smile: No promises though
Quick solution, add &|^ to regex for "special" segments
/(?<special>!=|[=%*/\-+,;:<>&\|\^])/
That works for me, but could you do a PR?
Wait, that doesn't address the main problem I listed in this issue. Why not just go with what I originally suggested?
/(?<special>[^\w\s"'`]+)/
As @wkeese mentioned in the description a fix for this issue might (or more likely will) affect #149. I'm very keen on resolving the number detection issue, and I think that coming up with a good solution to that issue first would be the better approach here. Then we could look at how to adapt the matching logic introduced there to work with multi character operators.
To add my five cents to the conversation regarding different regexes, without having actually tested these myself I believe implementing the more generic regex suggested by @wkeese would generally be a better approach since it takes away a bit of the burden maintaining that character list. That being there might be a risk it's a bit too greedy, so the tests will have to tell if it will actually work.
I believe implementing the more generic regex suggested by @wkeese would generally be a better approach since it takes away a bit of the burden maintaining that character list. That being there might be a risk it's a bit too greedy, so the tests will have to tell if it will actually work.
You made a lot of good points.
ISTM the root issue, probably impractical to solve, is that + and - can be unary or binary operators. So regardless of whether we have an explicit list of operators, this can be tokenized two ways:
x<>-5
And so can this:
x-5
But those are edge cases due to the lack of spaces.
I do think it makes sense for number matching to take priority as that will do the right thing for common expressions like
x > -5
PS: I added a PR #149 for the numbers. I points to the beta branch since presumably the master branch is now just for bug fixes.
I added #193, now we match all operators. Actually, it's using "special" as the bucket for anything that doesn't match the other regexps (number, identifier, string, etc.). But I added lots of tests and I've convinced myself that it isn't too greedy. In particular, that minus, plus, or dot before a digit is treated as part of the number.
The fix for this in #193 has been merged to the beta branch and will be released very soon. I'll close this a bit prematurely, but feel free to reopen if you feel it's not fixed.
:tada: This issue has been resolved in version 6.0.0 :tada:
The release is available on:
Your semantic-release bot :package::rocket: