skylighting icon indicating copy to clipboard operation
skylighting copied to clipboard

Haskell tokenization incorrectly splits backslash operators and lambda expressions

Open mgajda opened this issue 2 months ago • 4 comments

The Haskell syntax definition in Skylighting incorrectly tokenizes operators containing backslashes and lambda expressions, producing different results compared to GHC's own tokenizer.

Issues

1. Lambda Expressions (\x -> ...)

Skylighting lumps the lambda backslash and variable together as a single token:

\x -> x
  • GHC tokenizer: [λ, x, ->, x] (converts \ to Unicode lambda)
  • Skylighting: ["\x ", "->", " x"] (lumps \x together)

2. Set Difference Operator (\\)

Entire expression lumped into one token:

a \\ b
  • GHC tokenizer: [a, \\, b]
  • Skylighting: ["a \\\\ b"]

3. Logical OR Operator (\/)

Splits inconsistently, attaching backslash to preceding text:

a \/ b
  • GHC tokenizer: [a, \/, b]
  • Skylighting: ["a \\", "/", " b"]

4. Logical AND Operator (/\)

Splits inconsistently, attaching backslash to following text:

a /\ b
  • GHC tokenizer: [a, /\, b]
  • Skylighting: ["a ", "/", "\\ b"]

5. Custom Backslash Operators (\+, \>, etc.)

Sometimes splits instead of treating as atomic operators:

\+ 1
\> x
  • GHC tokenizer: [\+, 1] and [\>, x]
  • Skylighting: ["\\", "+", " ", 1] and ["\\", ">", " x"]

Expected Behavior

The Haskell syntax definition should recognize:

  1. Lambda expressions: \ followed by an identifier should be separate tokens
  2. Backslash operators: \\, \/, /\, and custom operators like \+, \> should be atomic operator tokens
  3. These should match GHC's tokenization behavior

Impact

This affects any tool using Skylighting for Haskell syntax highlighting/tokenization, including Pandoc, static site generators, documentation tools, and code formatters.

mgajda avatar Nov 17 '25 02:11 mgajda

Demonstration Code

Here's working Haskell code that demonstrates Skylighting's tokenization issues:

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Skylighting.Core
import qualified Data.Text as T
import Data.Text (Text)
import Data.Maybe (fromMaybe)

-- Test cases demonstrating the issue
testCases :: [(String, Text)]
testCases =
  [ ("Lambda", "\\x -> x")
  , ("Set difference", "a \\\\ b")
  , ("Logical OR", "a \\/ b")
  , ("Logical AND", "a /\\ b")
  , ("Custom op \\+", "\\+ 1")
  , ("Custom op \\>", "\\> x")
  ]

-- Pretty print a token
showToken :: Token -> String
showToken (tokType, text) = "\"" ++ T.unpack text ++ "\""

-- Main
main :: IO ()
main = do
  result <- loadSyntaxesFromDir "xml/"
  sMap <- case result of
    Left e -> error $ "Error loading syntax definitions: " ++ e
    Right m -> return m

  let syntax = fromMaybe (error "Haskell syntax not found") $
               lookupSyntax "haskell" sMap
      config = TokenizerConfig { syntaxMap = sMap, traceOutput = False }

  putStrLn "=== Skylighting Haskell Tokenization Issues ===\n"
  mapM_ (testCase config syntax) testCases

testCase :: TokenizerConfig -> Syntax -> (String, Text) -> IO ()
testCase config syntax (name, input) = do
  putStrLn $ "Test: " ++ name
  putStrLn $ "Input: " ++ show input
  case tokenize config syntax input of
    Right lines -> do
      let tokens = concat lines
      putStrLn $ "Skylighting tokens: [" ++
        (concat $ map (++ ", ") $ map showToken $ init tokens) ++
        showToken (last tokens) ++ "]"
    Left err -> putStrLn $ "Error: " ++ err
  putStrLn ""

How to run

  1. Clone the skylighting repository
  2. Save the code above in the repository root as test-backslash.hs
  3. Build and run:
cd skylighting-core
cabal build
cabal exec -- ghc -package skylighting-core ../test-backslash.hs -o ../test-backslash
cd ..
./test-backslash

Actual Output

=== Skylighting Haskell Tokenization Issues ===

Test: Lambda
Input: "\\x -> x"
Skylighting tokens: ["\x ", "->", " x"]

Test: Set difference
Input: "a \\\\ b"
Skylighting tokens: ["a \\ b"]

Test: Logical OR
Input: "a \\/ b"
Skylighting tokens: ["a \", "/", " b"]

Test: Logical AND
Input: "a /\\ b"
Skylighting tokens: ["a ", "/", "\ b"]

Test: Custom op \+
Input: "\\+ 1"
Skylighting tokens: ["\", "+", " ", "1"]

Test: Custom op \>
Input: "\\> x"
Skylighting tokens: ["\", ">", " x"]

Expected Behavior

For comparison, in valid Haskell these should tokenize as:

  • Lambda \x -> x["\\", "x", "->", "x"] (backslash separate from variable)
  • Set difference a \\ b["a", "\\\\", "b"] (double backslash as single operator)
  • Logical OR a \/ b["a", "\\/", "b"] (backslash-slash as single operator)
  • Logical AND a /\ b["a", "/\\", "b"] (slash-backslash as single operator)
  • Custom op \+["\\+", "1"] (backslash-plus as single operator)
  • Custom op \>["\\>", "x"] (backslash-greater as single operator)

The core issue is that backslash operators are not recognized as atomic tokens in the Haskell syntax definition.

mgajda avatar Nov 17 '25 02:11 mgajda

You can examine tokenization easily using the command-line tool.

% skylighting --syntax haskell --format native
\x -> x
[ [ ( NormalTok , "\\x " )
  , ( OtherTok , "->" )
  , ( NormalTok , " x" )
  ]
]

Note that our highlighters are derived from KDE syntax specifications. I verified using Kate editor that this is just how the KDE syntax specification for Haskell seems to work -- it doesn't give any special tokenization to lambdas. The issue would have to be addressed by modifications to haskell.xml (and submitted upstream). Let us know here if you do that. As far as I can see, there is no bug in his library, which just faithfully tries to interpreted the KDE syntax specifications.

jgm avatar Nov 17 '25 08:11 jgm

Thanks, I found it:

Root Cause

Two issues in haskell.xml:

  1. Line 476 (code context): The operator regex doesn't properly handle backslash's special role in Haskell operators. Backslash operators are not being matched atomically.

  2. Lines 542-549 (import context): The import context completely lacks operator matching rules, so operators in import lists like import Data.List (\\) are not properly tokenized at all.

Fix

A fix has been submitted upstream to the KDE syntax-highlighting repository:

  • Bug report: https://bugs.kde.org/show_bug.cgi?id=512318
  • Merge request: https://invent.kde.org/frameworks/syntax-highlighting/-/merge_requests/760

The fix adds:

  • Explicit backslash operator regex in code context (before general operator pattern)
  • Operator matching rules (both backslash and general) to import context
  • Test cases for lambda expressions and backslash operators
  • Bumped syntax version from 21 to 22

Once merged upstream and released, Skylighting will pull the updated definitions in a future release.

mgajda avatar Nov 19 '25 09:11 mgajda

Update: Fix submitted upstream and CI passes ✅

  • Upstream bug: https://bugs.kde.org/show_bug.cgi?id=512318
  • Merge request: https://invent.kde.org/frameworks/syntax-highlighting/-/merge_requests/760
  • CI status: All 9 jobs passed

The fix is ready for review by KDE maintainers.

mgajda avatar Nov 19 '25 19:11 mgajda