highlight.js icon indicating copy to clipboard operation
highlight.js copied to clipboard

(Haskell) Quotes aren't escaped correctly

Open auscyber opened this issue 3 years ago • 10 comments

Describe the issue Placing a quotation mark inside of single quotes thats escaped, '\"'

Which language seems to have the issue? haskell

Are you using highlight or highlightAuto? Unsure, it's in discord, im assuming highlight ...

Sample Code to Reproduce

char '\"' etc etc

Expected behavior

char '\"' etc etc

Additional context

auscyber avatar Oct 10 '21 23:10 auscyber

Our Haskell grammar only includes double quoted strings, there is currently no mention of single quoted strings ever being a possibility.

  • Could you point to some docs on this?
  • Any chance you might be willing to work on a PR?

joshgoebel avatar Oct 11 '21 05:10 joshgoebel

The single-quoted things are character literals, not string literals. See the Haskell report for their syntax. Both character and string literals support a number of escape sequences not commonly found in other languages. You could probably ignore all of those for highlighting in most cases … However, note that normal identifiers may also contain single-quotes (e.g. a'b' is a single identifier, not identifier a + char literal), and certain extensions (for which the GHC docs may need to be consulted) also make use of single-quotes prefixed onto identifiers (e.g. 'not and ''Bool name-quoting of value and type), so it’s important to note that while 'a' is always a char literal, 'ab' is a name-quotation of the identifier ab'. 'a'b is a character literal followed by identifier b, not a name-quotation of identifier a'b. Single-quotes in Haskell are commonly mishandled by highlighters because of this complexity.

Melvar avatar Oct 11 '21 09:10 Melvar

Single-quotes in Haskell are commonly mishandled by highlighters because of this complexity.

Can they [all these different cases] be detected unambiguously with forward-looking regex only, or would we need to be able to look backwards? If the cases are all distinct and can be detected by regex (without surrounding context) then we should be able to get it right.

I think we'd need someone who knows Haskell really well willing to work on a PR and I'm happy to answer questions and help from this end.

Is whitespace significant? Can we assume a character literal is always proceeded by a space? IE [space]'[char]'?

joshgoebel avatar Oct 11 '21 14:10 joshgoebel

While a char literal may be preceded by an operator or other syntactic entities without spaces, it seems it cannot be preceded without space by a regular identifier or alphabetic keyword – its opening single quote will be munched as part of the identifier, regardless of what follows. Given that, I think purely looking forward works, as long as each of the cases are identified and tried in the right order. For example, in an example like a'+'b', firstly you find the identifier a', then operator +, and finally a char literal 'b'. With a space to terminate the first identifier, a '+'b' instead consists of an identifier a, a char literal '+', and another identifier b'.

Melvar avatar Oct 11 '21 14:10 Melvar

So it sounds like if we just consume identifiers first (to prevent ' inside of them from triggering character literals)... then consume character literals... (our matcher engine does match regex in sequence)

  • a'b'c'd'e' (match identifiers)
  • 'b' (match character literal) (knowing the most popular escapes would help)
  • IF we care about name quotations 'ab' then stuff like 'not and ''Bool might be problematic unless there is a list or we can terminate them a known demarcation (whitespace, known list of operators, etc).

The big worry is hitting a ' and getting stuck looking for a ' that we will never find... I think this would only be an issue if we try to match name quotations (which I'm not sure we need to?)

@Melvar Any chance you'd like to try your hand at a PR?

joshgoebel avatar Oct 11 '21 15:10 joshgoebel

To perhaps clarify, a name quotation is one or two single quotes immediately before an identifier. In the case of 'ab', the name quoted is ab', that is, the second ' is part of the identifier. Simply trying to match a char before trying a name-quotation will handle that, as long as the char matcher is accurate enough (a char literal must contain a single printable character except ' or \, or a single escape sequence, between its single quotes).

Melvar avatar Oct 11 '21 15:10 Melvar

Our Haskell grammar only includes double quoted strings, there is currently no mention of single quoted strings ever being a possibility.

* Could you point to some docs on this?

* Any chance you might be willing to work on a PR?

I would be willing to, once I do some research into how parsing is done at all, what part of the grammar is parsing the single quoted chars in the first place Is this the line? https://github.com/highlightjs/highlight.js/blob/1a258a3e72dce8b6534e28a741cfcd7d813463dd/src/languages/haskell.js#L195

auscyber avatar Oct 11 '21 23:10 auscyber

No, that's double quoted. https://github.com/highlightjs/highlight.js/blob/1a258a3e72dce8b6534e28a741cfcd7d813463dd/src/lib/modes.js#L51

We match single quotes in only 3 places I see:

  • CONSTRUCTOR
  • inherit from TITLE_MODE in two places

The latter is probably already intended to prevent this but I'd wager the regex isn't sufficient enough or it's not being deployed in all the correct contexts.


If you were confused on the grammar itself but could even contribute just a list of tests cases and desired behavior for them that we could add to our test suite that'd be a start:

-- character literal
'a'
-- something else, etc...
a'b'c

joshgoebel avatar Oct 12 '21 11:10 joshgoebel

To handle JUST character literals though (as this issue was opened for) we'd just need a rule with a solid regex that could match characters:

{
  scope: "string.char",
  match: // ... single regex
}

Or better to detect char.escape also:

{
  scope: "string",
  begin: /'/, // perhaps also a look-ahead to prevent false positives
  end: /'/,
  contains: [
    { scope: "char.escape", match: /\/./ },
  ]
}

joshgoebel avatar Oct 12 '21 11:10 joshgoebel

@AusCyberman Still interested in helping out? trying to figure out what we might be able to get into the 11.5 release.

joshgoebel avatar Mar 01 '22 12:03 joshgoebel

Hi! I took a shot at this issue with #3723.

CrystalSplitter avatar Feb 28 '23 08:02 CrystalSplitter