mathjs icon indicating copy to clipboard operation
mathjs copied to clipboard

Support Chinese punctuation like full-width parentheses

Open josdejong opened this issue 1 month ago • 5 comments

I got an email from a Chinese user explaining that he has to switch the input from Chinese to English in order to type parenthesis in an expression. It would be really convenient for Chinese users if it is possible to use the Chinese full-width variants of characters like parentheses.

This Wikipedia page gives a useful overview: https://en.wikipedia.org/wiki/Chinese_punctuation#Marks_similar_to_European_punctuation. The characters that are relevant for mathjs are:

Chinese full-width character Chinese full-width code English character Character name
U+FF0C , Comma
U+3001 , Enumeration comma
U+FF01 ! Exclamation mark
U+FF1F ? Question mark
U+FF1B ; Semicolon
U+FF1A : Colon
U+FF08 ( Left parenthesis
U+FF09 ) Right parenthesis
U+FF3B [ Left square bracket
U+FF3D ] Right square bracket

If there is indeed need for this, we can implement support for it in two ways:

  1. Simple way: in the parse function, write a regular expression which replaces all full-width Chinese characters with their English variant. Or we could just provide two helper functions to replace Chinese full-width characters with their English variants and vice versa.
  2. More work: similar to functions like parse.isDigit, introduce functions like isParenthesisOpen which test for the English and Chinese variant of the character. We can also think though how to make sure that stringifying an expression keeps track on the originally used character. I'm not sure if this is worth the added complexity though.

Any thoughts:

josdejong avatar Nov 12 '25 09:11 josdejong

  • This item is definitely in the spirit of #3365, so I've added it to the list there.
  • This support should definitely happen at tokenization time.
  • It will definitely be much easier to achieve post (the update of) #3423, which has been upped in priority anyway, so let's wait until that is in, at least to the v16 branch, and then assess in more detail.
  • I think there is a third alternative, which is just to have a noticeably finer-grained collection of token types, including things like COLON, PAR_OPEN, PAR_CLOSE, BRACKET_OPEN, BRACKET_CLOSE (not proposing that exact naming scheme) rather than just the one DELIMITER token type covering all these cases. That will actually make the parsing code look a bit nicer, I think -- instead of testing if the token type is DELIMITER and then checking the text of the token for specific characters, it will just check if the token type is PAR_OPEN, for example. I think that's a plus -- it would be nice if there is very limited checking of specific individual characters at parsing time, confining that mostly to tokenization. I think (the redo of) #3423 will make this sort of tokenization change pretty straightforward.

gwhitney avatar Nov 12 '25 15:11 gwhitney

👍

You're right we definitely need some refactoring in the parser to make supporting unicode characters easier.

josdejong avatar Nov 12 '25 16:11 josdejong

Also more specifically to this issue, reading the Wikipedia page would suggest that the only Chinese punctuation character we should use as equivalent to the comma in mathjs would be the "Enumeration comma" 、(U+3001), since the only use of comma in mathjs is to separate the items in lists (function arguments, matrix entries, etc.). But if you could verify that point with your Chinese correspondent it would be helpful.

gwhitney avatar Nov 12 '25 16:11 gwhitney

You raise an important point: should converting a Node to string restore the original Unicode string? I actually think not; I think it is OK for it to produce the ASCII equivalents, because certainly Node-to-string already canonicalizes the expression in other ways (squeezing out excess whitespace, normalizing parenthesization, etc.) But in any case, this question is primarily one relevant to #3557: if the original location is preserved, presumably it's also easy to obtain the original characters used, as needed/desired.

gwhitney avatar Nov 12 '25 16:11 gwhitney

You raise an important point: should converting a Node to string restore the original Unicode string? I actually think not

Agree

josdejong avatar Nov 19 '25 15:11 josdejong