highlight.js
highlight.js copied to clipboard
(Haskell) Quotes aren't escaped correctly
Describe the issue
Placing a quotation mark inside of single quotes thats escaped, '\"'
Which language seems to have the issue? haskell
Are you using highlight
or highlightAuto
?
Unsure, it's in discord, im assuming highlight
...
Sample Code to Reproduce
char '\"' etc etc
Expected behavior
char '\"' etc etc
Additional context
Our Haskell grammar only includes double quoted strings, there is currently no mention of single quoted strings ever being a possibility.
- Could you point to some docs on this?
- Any chance you might be willing to work on a PR?
The single-quoted things are character literals, not string literals. See the Haskell report for their syntax. Both character and string literals support a number of escape sequences not commonly found in other languages. You could probably ignore all of those for highlighting in most cases …
However, note that normal identifiers may also contain single-quotes (e.g. a'b'
is a single identifier, not identifier a
+ char literal), and certain extensions (for which the GHC docs may need to be consulted) also make use of single-quotes prefixed onto identifiers (e.g. 'not
and ''Bool
name-quoting of value and type), so it’s important to note that while 'a'
is always a char literal, 'ab'
is a name-quotation of the identifier ab'
. 'a'b
is a character literal followed by identifier b
, not a name-quotation of identifier a'b
.
Single-quotes in Haskell are commonly mishandled by highlighters because of this complexity.
Single-quotes in Haskell are commonly mishandled by highlighters because of this complexity.
Can they [all these different cases] be detected unambiguously with forward-looking regex only, or would we need to be able to look backwards? If the cases are all distinct and can be detected by regex (without surrounding context) then we should be able to get it right.
I think we'd need someone who knows Haskell really well willing to work on a PR and I'm happy to answer questions and help from this end.
Is whitespace significant? Can we assume a character literal is always proceeded by a space? IE [space]'[char]'
?
While a char literal may be preceded by an operator or other syntactic entities without spaces, it seems it cannot be preceded without space by a regular identifier or alphabetic keyword – its opening single quote will be munched as part of the identifier, regardless of what follows.
Given that, I think purely looking forward works, as long as each of the cases are identified and tried in the right order. For example, in an example like a'+'b'
, firstly you find the identifier a'
, then operator +
, and finally a char literal 'b'
. With a space to terminate the first identifier, a '+'b'
instead consists of an identifier a
, a char literal '+'
, and another identifier b'
.
So it sounds like if we just consume identifiers first (to prevent '
inside of them from triggering character literals)... then consume character literals... (our matcher engine does match regex in sequence)
-
a'b'c'd'e'
(match identifiers) -
'b'
(match character literal) (knowing the most popular escapes would help) - IF we care about name quotations
'ab'
then stuff like'not
and''Bool
might be problematic unless there is a list or we can terminate them a known demarcation (whitespace, known list of operators, etc).
The big worry is hitting a '
and getting stuck looking for a '
that we will never find... I think this would only be an issue if we try to match name quotations (which I'm not sure we need to?)
@Melvar Any chance you'd like to try your hand at a PR?
To perhaps clarify, a name quotation is one or two single quotes immediately before an identifier. In the case of 'ab'
, the name quoted is ab'
, that is, the second '
is part of the identifier. Simply trying to match a char before trying a name-quotation will handle that, as long as the char matcher is accurate enough (a char literal must contain a single printable character except '
or \
, or a single escape sequence, between its single quotes).
Our Haskell grammar only includes double quoted strings, there is currently no mention of single quoted strings ever being a possibility.
* Could you point to some docs on this? * Any chance you might be willing to work on a PR?
I would be willing to, once I do some research into how parsing is done at all, what part of the grammar is parsing the single quoted chars in the first place Is this the line? https://github.com/highlightjs/highlight.js/blob/1a258a3e72dce8b6534e28a741cfcd7d813463dd/src/languages/haskell.js#L195
No, that's double quoted. https://github.com/highlightjs/highlight.js/blob/1a258a3e72dce8b6534e28a741cfcd7d813463dd/src/lib/modes.js#L51
We match single quotes in only 3 places I see:
-
CONSTRUCTOR
- inherit from
TITLE_MODE
in two places
The latter is probably already intended to prevent this but I'd wager the regex isn't sufficient enough or it's not being deployed in all the correct contexts.
If you were confused on the grammar itself but could even contribute just a list of tests cases and desired behavior for them that we could add to our test suite that'd be a start:
-- character literal
'a'
-- something else, etc...
a'b'c
To handle JUST character literals though (as this issue was opened for) we'd just need a rule with a solid regex that could match characters:
{
scope: "string.char",
match: // ... single regex
}
Or better to detect char.escape
also:
{
scope: "string",
begin: /'/, // perhaps also a look-ahead to prevent false positives
end: /'/,
contains: [
{ scope: "char.escape", match: /\/./ },
]
}
@AusCyberman Still interested in helping out? trying to figure out what we might be able to get into the 11.5 release.
Hi! I took a shot at this issue with #3723.