Haskell tokenization incorrectly splits backslash operators and lambda expressions
The Haskell syntax definition in Skylighting incorrectly tokenizes operators containing backslashes and lambda expressions, producing different results compared to GHC's own tokenizer.
Issues
1. Lambda Expressions (\x -> ...)
Skylighting lumps the lambda backslash and variable together as a single token:
\x -> x
-
GHC tokenizer:
[λ, x, ->, x](converts\to Unicode lambda) -
Skylighting:
["\x ", "->", " x"](lumps\xtogether)
2. Set Difference Operator (\\)
Entire expression lumped into one token:
a \\ b
-
GHC tokenizer:
[a, \\, b] -
Skylighting:
["a \\\\ b"]
3. Logical OR Operator (\/)
Splits inconsistently, attaching backslash to preceding text:
a \/ b
-
GHC tokenizer:
[a, \/, b] -
Skylighting:
["a \\", "/", " b"]
4. Logical AND Operator (/\)
Splits inconsistently, attaching backslash to following text:
a /\ b
-
GHC tokenizer:
[a, /\, b] -
Skylighting:
["a ", "/", "\\ b"]
5. Custom Backslash Operators (\+, \>, etc.)
Sometimes splits instead of treating as atomic operators:
\+ 1
\> x
-
GHC tokenizer:
[\+, 1]and[\>, x] -
Skylighting:
["\\", "+", " ", 1]and["\\", ">", " x"]
Expected Behavior
The Haskell syntax definition should recognize:
- Lambda expressions:
\followed by an identifier should be separate tokens - Backslash operators:
\\,\/,/\, and custom operators like\+,\>should be atomic operator tokens - These should match GHC's tokenization behavior
Impact
This affects any tool using Skylighting for Haskell syntax highlighting/tokenization, including Pandoc, static site generators, documentation tools, and code formatters.
Demonstration Code
Here's working Haskell code that demonstrates Skylighting's tokenization issues:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Skylighting.Core
import qualified Data.Text as T
import Data.Text (Text)
import Data.Maybe (fromMaybe)
-- Test cases demonstrating the issue
testCases :: [(String, Text)]
testCases =
[ ("Lambda", "\\x -> x")
, ("Set difference", "a \\\\ b")
, ("Logical OR", "a \\/ b")
, ("Logical AND", "a /\\ b")
, ("Custom op \\+", "\\+ 1")
, ("Custom op \\>", "\\> x")
]
-- Pretty print a token
showToken :: Token -> String
showToken (tokType, text) = "\"" ++ T.unpack text ++ "\""
-- Main
main :: IO ()
main = do
result <- loadSyntaxesFromDir "xml/"
sMap <- case result of
Left e -> error $ "Error loading syntax definitions: " ++ e
Right m -> return m
let syntax = fromMaybe (error "Haskell syntax not found") $
lookupSyntax "haskell" sMap
config = TokenizerConfig { syntaxMap = sMap, traceOutput = False }
putStrLn "=== Skylighting Haskell Tokenization Issues ===\n"
mapM_ (testCase config syntax) testCases
testCase :: TokenizerConfig -> Syntax -> (String, Text) -> IO ()
testCase config syntax (name, input) = do
putStrLn $ "Test: " ++ name
putStrLn $ "Input: " ++ show input
case tokenize config syntax input of
Right lines -> do
let tokens = concat lines
putStrLn $ "Skylighting tokens: [" ++
(concat $ map (++ ", ") $ map showToken $ init tokens) ++
showToken (last tokens) ++ "]"
Left err -> putStrLn $ "Error: " ++ err
putStrLn ""
How to run
- Clone the skylighting repository
- Save the code above in the repository root as
test-backslash.hs - Build and run:
cd skylighting-core
cabal build
cabal exec -- ghc -package skylighting-core ../test-backslash.hs -o ../test-backslash
cd ..
./test-backslash
Actual Output
=== Skylighting Haskell Tokenization Issues ===
Test: Lambda
Input: "\\x -> x"
Skylighting tokens: ["\x ", "->", " x"]
Test: Set difference
Input: "a \\\\ b"
Skylighting tokens: ["a \\ b"]
Test: Logical OR
Input: "a \\/ b"
Skylighting tokens: ["a \", "/", " b"]
Test: Logical AND
Input: "a /\\ b"
Skylighting tokens: ["a ", "/", "\ b"]
Test: Custom op \+
Input: "\\+ 1"
Skylighting tokens: ["\", "+", " ", "1"]
Test: Custom op \>
Input: "\\> x"
Skylighting tokens: ["\", ">", " x"]
Expected Behavior
For comparison, in valid Haskell these should tokenize as:
-
Lambda
\x -> x→["\\", "x", "->", "x"](backslash separate from variable) -
Set difference
a \\ b→["a", "\\\\", "b"](double backslash as single operator) -
Logical OR
a \/ b→["a", "\\/", "b"](backslash-slash as single operator) -
Logical AND
a /\ b→["a", "/\\", "b"](slash-backslash as single operator) -
Custom op
\+→["\\+", "1"](backslash-plus as single operator) -
Custom op
\>→["\\>", "x"](backslash-greater as single operator)
The core issue is that backslash operators are not recognized as atomic tokens in the Haskell syntax definition.
You can examine tokenization easily using the command-line tool.
% skylighting --syntax haskell --format native
\x -> x
[ [ ( NormalTok , "\\x " )
, ( OtherTok , "->" )
, ( NormalTok , " x" )
]
]
Note that our highlighters are derived from KDE syntax specifications. I verified using Kate editor that this is just how the KDE syntax specification for Haskell seems to work -- it doesn't give any special tokenization to lambdas. The issue would have to be addressed by modifications to haskell.xml (and submitted upstream). Let us know here if you do that. As far as I can see, there is no bug in his library, which just faithfully tries to interpreted the KDE syntax specifications.
Thanks, I found it:
Root Cause
Two issues in haskell.xml:
-
Line 476 (code context): The operator regex doesn't properly handle backslash's special role in Haskell operators. Backslash operators are not being matched atomically.
-
Lines 542-549 (import context): The import context completely lacks operator matching rules, so operators in import lists like
import Data.List (\\)are not properly tokenized at all.
Fix
A fix has been submitted upstream to the KDE syntax-highlighting repository:
- Bug report: https://bugs.kde.org/show_bug.cgi?id=512318
- Merge request: https://invent.kde.org/frameworks/syntax-highlighting/-/merge_requests/760
The fix adds:
- Explicit backslash operator regex in code context (before general operator pattern)
- Operator matching rules (both backslash and general) to import context
- Test cases for lambda expressions and backslash operators
- Bumped syntax version from 21 to 22
Once merged upstream and released, Skylighting will pull the updated definitions in a future release.
Update: Fix submitted upstream and CI passes ✅
- Upstream bug: https://bugs.kde.org/show_bug.cgi?id=512318
- Merge request: https://invent.kde.org/frameworks/syntax-highlighting/-/merge_requests/760
- CI status: All 9 jobs passed
The fix is ready for review by KDE maintainers.