megaparsec Error message pretty printing is unaware of width of a character

Error message pretty printing is unaware of width of a character

Open suhdonghwi opened this issue 4 years ago • 17 comments

errorBundlePretty is not working properly if input contains full-width character.

Character pointer (^) should be pointing '이', but it is pointing the wrong position. Full-width character should be replaced with two spaces, not one.

Sep 15 '19 04:09 suhdonghwi

Possibly related to #362.

Sep 15 '19 10:09 mrkkrp

@simonvandel Can you please provide the example as unicode text, not as an image?

Oct 26 '19 15:10 mrkkrp

@mrkkrp Did you mean to refer me? Then the example text is "123 구구 이면".

Oct 26 '19 15:10 suhdonghwi

Yes, sorry. Thanks for the example!

Oct 26 '19 15:10 mrkkrp

OK, let's see. You're saying that full width characters should be replaced with two spaces. What we have at this point is column position. Simply put, we insert as many spaces as necessary to reach the same column position. So are you of the opinion that we need to increment current column by 2 instead of 1 for full width characters? Is this something that software working with Korean characters usually does?

Oct 26 '19 17:10 mrkkrp

Yes, it is usual practice because in monospaced font, two half-width spaces and one full-width character (ex. Korean syllable) are always the same width. So I think it is safe to put two spaces when it encounters full-width character.

Oct 26 '19 17:10 suhdonghwi

What I'm more concerned about, is that it may be confusing to see position like 1:5 when you only consumed two characters.

Oct 26 '19 17:10 mrkkrp

Also need to find a good way to detect full-width characters.

Oct 26 '19 18:10 mrkkrp

I don't know how it's working inside so I can't really know what is proper solution for that problem. But maybe it is possible to make use of full-width space character? It is a single space character but full-width.

Regarding the detection of full-width characters, there is a method in ICU library to check if a character is full-width or not. But ICU library itself is pretty heavy, so you can extract the method from the library.

Oct 26 '19 18:10 suhdonghwi

But maybe it is possible to make use of full-width space character?

Alas, we do not have information which characters were full-width and which ones were normal by the time we're printing parse errors. As I mentioned, we only have column position, so we have to work with that.

Oct 26 '19 18:10 mrkkrp

I looked at this a bit today and I do not know how to check for full-width characters efficiently and without depending on a library. Even in text-icu you can get Unicode category and check if it's HalfwidthAndFullwidthForms but according to the Wikipedia's page about this Unicode block (and it sort of follows from the name of the block) it's not guaranteed that those characters will be full width, it's in fact a mix of the two.

Oct 27 '19 13:10 mrkkrp

The relevant property here is probably EastAsianWidth. The detailed description of individual property values is given in UAX #11, but the high-level summary seems to be

Attribute value	Width
`EANeutral`	narrow
`EAAmbiguous`	ambiguous
`EAHalf`	narrow
`EAFull`	wide
`EANarrow`	narrow
`EAWide`	wide
`EACount`	Not a valid value; probably an artefact of translation from C

If you don't want to depend on text-icu, all the relevant information is contained in EastAsianWidth.txt, in a fairly straightforward format.

Nov 24 '19 10:11 ivan-timokhin

Using text-icu, a fix for errorBundlePretty seems pretty easy:

Change

rpadding = replicate rpshift ' '

rpadding = [ if isWide c then '　' else ' ' | c <- take rpshift sline ]

Using

import Data.Text.ICU.Char (property, EastAsianWidth(..), EastAsianWidth_(..))
isWide c = property EastAsianWidth c `elem` [EAFull, EAWide]

I'd be happy to make a PR for that.

May 10 '20 19:05 obfusk

The question is only whether or not adding an extra dependency is worth it. In the past people were really upset about each and every extra dependency.

May 10 '20 20:05 mrkkrp

I'd prefer the extra dependency if it means better Unicode support in the error message rendering. One of the main attractions of megaparsec (for me and my team) is the high quality, effortless error messages.

May 10 '20 22:05 recursion-ninja

I get that. In that case I see 4 options:

~don't do anything~
add the text-icu dependency (easy but not ideal)
add code for EAW directly (ugly)
add an errorBundlePrettyThatHandlesCharacterWidth to the API that adds the isWide predicate as an extra parameter to errorBundlePretty (also not ideal but allows downstream users like me who don't mind depending on text-icu to handle these cases fairly easily)

I'd be happy to help with any of those :)

May 10 '20 22:05 obfusk

I wrote a few lines of python to generate the relevant code point ranges:

#!/usr/bin/python3
import itertools as IT, unicodedata as UD

def grouper(iterable, n):
  args = [iter(iterable)] * n
  return IT.zip_longest(*args, fillvalue = None)

ranges, start = [], None
for i in range(0x10FFFF):
  if UD.category(chr(i)) != "Cn" and UD.east_asian_width(chr(i)) in "FW":
    if start is None: start = i
  else:
    if start is not None:
      ranges.append((start, i-1))
      start = None

print("(0, {}) [".format(len(ranges)-1))
print("    " + ",\n    ".join(
  ", ".join( "(0x{:06x}, 0x{:06x})".format(*r) for r in g if r )
  for g in grouper(ranges, 3)
))
print("  ]")

which then results in this bit of haskell:

#!/usr/bin/runhaskell

import Data.Char (ord)
import Data.Array (Array, Ix, (!), listArray, bounds)
import System.Environment (getArgs)

isWide :: Char -> Bool
isWide c = go $ bounds wideRanges
  where
    go (lo, hi)
        | hi < lo           = False
        | a <= n && n <= b  = True
        | n < a             = go (lo, pred mid)
        | otherwise         = go (succ mid, hi)
      where
        mid     = (lo + hi) `div` 2
        (a, b)  = wideRanges ! mid
    n = ord c

wideRanges :: Array Int (Int, Int)
wideRanges = listArray (0, 118) [
    (0x001100, 0x00115f), (0x00231a, 0x00231b), (0x002329, 0x00232a),
    (0x0023e9, 0x0023ec), (0x0023f0, 0x0023f0), (0x0023f3, 0x0023f3),
    (0x0025fd, 0x0025fe), (0x002614, 0x002615), (0x002648, 0x002653),
    (0x00267f, 0x00267f), (0x002693, 0x002693), (0x0026a1, 0x0026a1),
    (0x0026aa, 0x0026ab), (0x0026bd, 0x0026be), (0x0026c4, 0x0026c5),
    (0x0026ce, 0x0026ce), (0x0026d4, 0x0026d4), (0x0026ea, 0x0026ea),
    (0x0026f2, 0x0026f3), (0x0026f5, 0x0026f5), (0x0026fa, 0x0026fa),
    (0x0026fd, 0x0026fd), (0x002705, 0x002705), (0x00270a, 0x00270b),
    (0x002728, 0x002728), (0x00274c, 0x00274c), (0x00274e, 0x00274e),
    (0x002753, 0x002755), (0x002757, 0x002757), (0x002795, 0x002797),
    (0x0027b0, 0x0027b0), (0x0027bf, 0x0027bf), (0x002b1b, 0x002b1c),
    (0x002b50, 0x002b50), (0x002b55, 0x002b55), (0x002e80, 0x002e99),
    (0x002e9b, 0x002ef3), (0x002f00, 0x002fd5), (0x002ff0, 0x002ffb),
    (0x003000, 0x00303e), (0x003041, 0x003096), (0x003099, 0x0030ff),
    (0x003105, 0x00312f), (0x003131, 0x00318e), (0x003190, 0x0031ba),
    (0x0031c0, 0x0031e3), (0x0031f0, 0x00321e), (0x003220, 0x003247),
    (0x003250, 0x004db5), (0x004e00, 0x009fef), (0x00a000, 0x00a48c),
    (0x00a490, 0x00a4c6), (0x00a960, 0x00a97c), (0x00ac00, 0x00d7a3),
    (0x00f900, 0x00fa6d), (0x00fa70, 0x00fad9), (0x00fe10, 0x00fe19),
    (0x00fe30, 0x00fe52), (0x00fe54, 0x00fe66), (0x00fe68, 0x00fe6b),
    (0x00ff01, 0x00ff60), (0x00ffe0, 0x00ffe6), (0x016fe0, 0x016fe3),
    (0x017000, 0x0187f7), (0x018800, 0x018af2), (0x01b000, 0x01b11e),
    (0x01b150, 0x01b152), (0x01b164, 0x01b167), (0x01b170, 0x01b2fb),
    (0x01f004, 0x01f004), (0x01f0cf, 0x01f0cf), (0x01f18e, 0x01f18e),
    (0x01f191, 0x01f19a), (0x01f200, 0x01f202), (0x01f210, 0x01f23b),
    (0x01f240, 0x01f248), (0x01f250, 0x01f251), (0x01f260, 0x01f265),
    (0x01f300, 0x01f320), (0x01f32d, 0x01f335), (0x01f337, 0x01f37c),
    (0x01f37e, 0x01f393), (0x01f3a0, 0x01f3ca), (0x01f3cf, 0x01f3d3),
    (0x01f3e0, 0x01f3f0), (0x01f3f4, 0x01f3f4), (0x01f3f8, 0x01f43e),
    (0x01f440, 0x01f440), (0x01f442, 0x01f4fc), (0x01f4ff, 0x01f53d),
    (0x01f54b, 0x01f54e), (0x01f550, 0x01f567), (0x01f57a, 0x01f57a),
    (0x01f595, 0x01f596), (0x01f5a4, 0x01f5a4), (0x01f5fb, 0x01f64f),
    (0x01f680, 0x01f6c5), (0x01f6cc, 0x01f6cc), (0x01f6d0, 0x01f6d2),
    (0x01f6d5, 0x01f6d5), (0x01f6eb, 0x01f6ec), (0x01f6f4, 0x01f6fa),
    (0x01f7e0, 0x01f7eb), (0x01f90d, 0x01f971), (0x01f973, 0x01f976),
    (0x01f97a, 0x01f9a2), (0x01f9a5, 0x01f9aa), (0x01f9ae, 0x01f9ca),
    (0x01f9cd, 0x01f9ff), (0x01fa70, 0x01fa73), (0x01fa78, 0x01fa7a),
    (0x01fa80, 0x01fa82), (0x01fa90, 0x01fa95), (0x020000, 0x02a6d6),
    (0x02a700, 0x02b734), (0x02b740, 0x02b81d), (0x02b820, 0x02cea1),
    (0x02ceb0, 0x02ebe0), (0x02f800, 0x02fa1d)
  ]

main :: IO ()
main = mapM_ (print . isWide) . concat =<< getArgs

Sep 07 '20 13:09 obfusk

megaparsec megaparsec copied to clipboard

Error message pretty printing is unaware of width of a character

megaparsec
megaparsec copied to clipboard