mathjs icon indicating copy to clipboard operation
mathjs copied to clipboard

Proposed parsing refactor

Open gwhitney opened this issue 9 months ago • 13 comments

Describe the concerns There are a number of outstanding parsing/evaluation issues, such as #2631, which it would be convenient to work on in conjunction with #3374 (in particular the implementation of the quotient operator to be represented also by infix // ala Python -- that needs some specialized TeX representation, and right now it is tricky to to make that TeX handling uniform between quotient(a,b) and a // b). These items generally point to unifying FunctionNode and OperatorNode -- so I am planning to go ahead with that, presuming you are still willing on that, and I was thinking of calling the unified node a CallNode. (It's easiest to use a third name to make sure all instances of each have been handled in the unification.)

Anyhow, working on the unification of FunctionNode and OperatorNode led me to realize that there is currently redundancy between the big table in src/expression/operators.js and a number of small tables in src/expression/parse.js, such as the one in the (somewhat awkwardly named) function parseMultiplyDivideModulusPercentage. In addition, although there is a mechanism for adding custom nodes to the parser, it is not particularly well documented, and it is limited in scope: basically the custom nodes can only be created by syntax that looks like a function call. Thus, for example in our color computations, we would like to make expressions like #FF80C0 be color constants; this doesn't really interfere with using the # for comments, because you just need to avoid having 6 or 8 hex digits immediately after the # for a comment, and it reads very naturally. But right now there is no way to extend the parser to handle such things.

Proposal

  • Obtain all of the parser-driving symbol/precedence information from a single unified source of truth, like the one in operators.js.
  • Expose that big table in the mathjs configuration. Design question: should it be a big complicated property, perhaps named operators, directly in the config object? Or should we have a separate exposed parseConfig object, parallel to config that is just concerned with parsing? If so, should number and numberFallback properties move to parseConfig since they really only concern parsing expressions, so that config can deal with computation and parseConfig with parsing? Note that having the table of operators in the configuration would resolve #2722, for example -- someone wanting to parse && as and would just need to insert the appropriate entry in the table for logical operators before calling parse.
  • Recommended addition to this proposal: move the tokenizer or at least tables for the tokenizer into config or parseConfiguration as well, so that syntax extensions like #FF80C0 for a color constant can be supported.
  • Recommended addition to this proposal: move each precedence level's "local parsing" function, e.g, parseAddSubtract, into the operators.js-style table, with an automatic facility to call the function at one higher level of precedence to get the subexpressions of the current precedence level; this additional refactor would facilitate run-time tweaking of the parser, easing development of parsing improvements and enabling other unforeseen syntax extensions on the part of clients of mathjs, beyond just the current custom nodes facility.
  • Mildly recommended addition to this proposal: eliminate the CustomNodes facility because once the precedence table is exposed, you can implement custom nodes by just inserting an entry in the table at the appropriate precedence that recognizes your custom syntax and returns a node of your choice (custom or not). For example, if you prefer EXPR if COND else ALTERNATE to COND ? EXPR : ALTERNATE, you could add an entry to the table just before the one for the ternary that recognizes this alternate syntax, but just returns a ConditionalNode.
  • fully document all customization opportunities that result from the refactor, including the custom nodes facility if it is kept.

I am going to embark on these refactors, and let you know how it goes; I welcome feedback in the meantime. Design question: do you consider the unification of OperatorNode and FunctionNode to CallNode a breaking change in and of itself, because mathjs does document and expose its list of valid node types? If so, would you like this refactor, presuming it is OK with you, to go in two steps, one that doesn't unify the nodes but does make the parser more DRY and configurable, followed by one that does unify the nodes that would have to go into mathjs 15?

Thanks for your thoughts on this.

gwhitney avatar Mar 16 '25 23:03 gwhitney

OK, after diving into an actual refactor, I realized/remembered that there are already quite several parsing configuration settings (many of them to do with tokenization, actually) exposed as properties of the math.parse function. Clients of the package are specifically advised to change these properties to reconfigure parsing, e.g. to allow unicode characters in identifiers.

With a number of new proposed configuration parameters in the works here and #3374, these considerations lead to the following possible organization alternatives for all of the configuration parameters, which do seem to fall into two families, those for parsing (e.g. existing parse.isAlpha, config.number, and proposed operators table) and those for computation (e.g., existing config.relTol, config.predictable, proposed config.numberApproximate).

  1. Move them all to top-level properties of the config object: parse.isAlpha => config.isAlpha, create config.operators, config. numberApproximate, etc. For backwards compatibility, parse.isAlpha could have getter/setter functions that make it an alias of config.isAlpha.
  2. Have a separate parseConfig object analogous to config to house the parsing configuration: parse.isAlpha => parseConfig.isAlpha, config.number => parseConfig.number, create config.numberApproximate, parseConfig.operators, etc. We will likely be able to provide backwards-compatibility aliases, although there is some worry about creating circular dependencies.
  3. Leave parsing configuration properties as properties of the parse function and consolidate all of them there: config.number => parse.number, create parse.operators, create config.numberApproximate, etc. We could try to provide backwards-compatibility aliases, but likely would run into circular-dependency problems between the parse function and the config object. So this would likely necessitate a breaking change.
  4. Move to a two-level hierarchy inside of existing config: put parsing properties on config.parse, and compute properties on config.compute. So parse.isAlpha => config.parse.isAlpha, config.number => config.parse.number, create config.compute.numberApproximate, config.parse.operators, etc. Existing top-level properties would remain (for now) as backwards-compatibility aliases for the two-level hierarchy, exactly as config.epsilon is now.

As the most similar to the current state of the world, I will pursue (3) for now, even though it's a bit inconsistent as far as the mechanisms for configuring parsing and computation go and I don't at the moment see how to do it without a breaking change. But if you prefer any of the other options or some other configuration scheme, please just let me know and I will be happy to switch gears.

gwhitney avatar Mar 17 '25 17:03 gwhitney

First: your refactoring ideas sounds like a good plan in general. This probably is quite some work to implement, maybe it is handy not to work on too many large refactors at a time and park this for a bit and give nanomath priority for example?

Thoughts:

  • This indeed sounds like two steps that can be implemented separately:

    • Unify FunctionNode and OperatorNode into CallNode. This is indeed a breaking change.
    • Refactor the parser to support syntax extensions, moving all config into one place etc. This may not be a breaking change.
  • We could also consider removing the tokenizer as a separate step, and just do the parsing directly. I've been working the parser of jsonquery recently: src/parse.ts. It is relatively limited and is not optimized for performance, but it does have support for custom functions and operators. This allows keeping the parser very simple and for example uses regular expressions to parse numbers and strings etc. It may give some inspiration for different approaches. I'm not sure if this would be a good approach for mathjs though :).

  • I may be interested in doing a full rewrite of the parser myself. There are also limitations like no stacktrace, and I have long time wishes like supporting loops and local variables etc. But maybe it's best to do that in nanomath where we have more freedom and can start from a clean slate.

  • I think it would be most neat to unify all parser related config in a nested object config.parse, your option (4). Indeed backward compatibility is preferrable if possible, or a clear explanatory error message otherwise.

josdejong avatar May 07 '25 10:05 josdejong

First: your refactoring ideas sounds like a good plan in general. This probably is quite some work to implement, maybe it is handy not to work on too many large refactors at a time and park this for a bit and give nanomath priority for example?

Well, in practice I have been giving nanomath priority, as you have seen. But the parser is pretty self-contained, so this does not have that far-reaching of consequences as you might think. We can play the scheduling by ear, e.g. whether unifying Function and Operator Nodes goes into v15 or not.

Thoughts:

  • This indeed sounds like two steps that can be implemented separately:

  • Unify FunctionNode and OperatorNode into CallNode. This is indeed a breaking change.

It's unfortunate that changing the internal nodes of a parse tree would be considered a breaking change. But I suppose they are a documented API. :-/

  • Refactor the parser to support syntax extensions, moving all config into one place etc. This may not be a breaking change.

Sure, I could do this in #3423, keeping it non-breaking. That's one option forward.

  • We could also consider removing the tokenizer as a separate step, and just do the parsing directly. I've been working the parser of jsonquery recently: src/parse.ts. It is relatively limited and is not optimized for performance, but it does have support for custom functions and operators. This allows keeping the parser very simple and for example uses regular expressions to parse numbers and strings etc. It may give some inspiration for different approaches. I'm not sure if this would be a good approach for mathjs though :).

That's up to you. Thinking in terms of token streams for parsing seems like a conceptual simplification to me, but your mileage may vary.

  • I may be interested in doing a full rewrite of the parser myself. There are also limitations like no stacktrace, and I have long time wishes like supporting loops and local variables etc. But maybe it's best to do that in nanomath where we have more freedom and can start from a clean slate.

OK if you'd like me to close #3423 unmerged and not pursue this further, just let me know. I will just say that if I were building a MaJE parser from scratch, I would use some well established parser package so that there is a clear (mostly) declarative description of the language, rather than manually writing a bespoke parser. I think a clear language description would be a real maintenance boon.

Anyhow I won't continue further work on #3423 until I hear back.

  • I think it would be most neat to unify all parser related config in a nested object config.parse, your option (4). Indeed backward compatibility is preferrable if possible, or a clear explanatory error message otherwise.

OK, that sounds fine. If/when you give the green light for further work on this, I will change #3423 to adopt option (4) rather than (3).

gwhitney avatar May 07 '25 16:05 gwhitney

Though a full rewrite of the parser is tempting, I think achieving the direct goals of this PR does not require it: we can very well extend the existing parser. A full rewrite is always more work than you expect, so I think it is best at this point to go ahead with the start you made in #3423.

I would love to do this in two separate steps tough. First a preparatory, non-breaking refactoring step. And next, the actual changes in config and introduction of a new CallNode. Is that ok with you?

josdejong avatar May 14 '25 15:05 josdejong

OK, had a moment to look over this. All agreed -- when I next have time to work on this code, I will update #3423 to reflect configuration parameter scheme (4) above, but keep it non-breaking by making existing parameters that need to be moved to config.parse.XXX to be (deprecated) two-way backward-comaptibility aliases for the new parameter names. Then as you say, the unification of OperatorNode and FunctionNode into CallNode can be a separate breaking PR for the next major revision at that point. (I am traveling for about 10 more days, then there will be some time getting my professional work back up to speed, just to give you an idea of the time frame so you can better schedule releases, that sort of thing.)

gwhitney avatar Jun 07 '25 23:06 gwhitney

Thanks for the update Glen, enjoy your trip :)

josdejong avatar Jun 11 '25 07:06 josdejong

Since this will be a breaking change anyway (parse.isAlpha not configurable on the library instance) along with a wholesale renaming (with backwards translations) of the config parameters, I thought I would be clear on every renaming I was planning to do, so you could review and approve, and ask if you want to retire any of the old backwards-compatible configuration parameters (maybe some have been deprecated long enough).

Existing parameter Proposed new parameter Comments
config.epsilon (no point in renaming, deprecated anyway) This has been deprecated for some time, may we instead remove completely now?
config.legacySubset config.compatibility.subset This has carried a warning about "possibly being deprecated" for two major revisions; may we officially deprecate it now?
config.randomSeed config.compute.randomSeed
config.predictable config.compute.uniformType Suggested name change (as opposed to just config.compute.predictable) to clarify that it should be just about the data type of the result, and not get overloaded for any other uses.
config.precision config.compute.BigNumber.precision Since it only applies to BigNumbers and we are planning to divide parameters by datatype anyway (per #3556), why not move it now to reduce confusion?
config.matrix config.compute.Matrix.defaultType Similar motivation, also the proposed rename allows supporting other Matrix types, should they be added, e.g. ones using JavaScript TypedArrays, etc.
config.relTol config.compute.defaultRelTol Opening up space for the proposed config.compute.number.relTol, config.compute.BigNumber.relTol, etc. (those will not be implemented in the upcoming PR for this).
config.absTol config.compute.defaultAbsTol ditto
config.number config.parse.number Clarifying this one is only to be used to control reading number input
config.numberFallback config.parse.numberFallback ditto
parse.isValidLatinOrGreek none This should not be configurable; it is not a matter of configuration what is a latin or greek letter. Those are well-defined unicode code points. I will expose an "is-function" isLatinOrGreek instead so that clients can use it in their config.parse.isAlpha functions, which is the only place it is currently used.
parse.isAlpha config.parse.isAlpha On the other hand, what constitutes an alphabetic character should be configurable (someone might want to allow Hebrew or Arabic letters, say).
parse.isValidMathSymbol config.parse.isMathSymbol Note slight shortening of name; not sure why the "Valid" part is needed -- if this function returns true on it, it is a math symbol. And this should definitely be configurable, there are lots more Unicode math symbols since this function was written.
parse.isWhitespace config.parse.isWhitespace Ditto on many more whitespace Unicode code points now.
parse.isDecimalMark config.parse.isDecimalMark Configurable to allow comma instead of dot
parse.isDigit config.parse.isDigit Indeed, some other writing systems have other digits
parse.isDigitDot config.parse.isDigitDecimal Note slight name change, the semantics are "either digit or decimal mark". Note I also plan to make its default implementation just be isDigit(c, cnext) or isDecimalMark(c, cnext) so that it will automatically respond to changes in the previous two.
new config.parse.isHexDigit That could be language-dependent, too, for all I know
new config.parse.isNumberStart Exposes the algorithm for recognizing the beginning of a number literal, so that could be customized, say if you wanted to add an explicit notation for bigint literals or allow other number bases besides decimal, binary, hex, and octal, or whatever.
new config.parse.tokens Exposes the tokenizers for each token type (the heart of this new PR). The DELIMITER parser will handle not parsing a ?. token in a?.3:.7 in a streamlined fashion

OK, so far as I can see, that's everything. I'll get in progress and in the meantime you can let me know what you think, especially about retiring config.epsilon altogether and switching config.legacySubset to being officially deprecated. Thanks!

gwhitney avatar Nov 06 '25 23:11 gwhitney

OK, having gotten into it, clearly just the configuration reorganization, without any new configuration options or any parsing changes, needs to come first as its own PR to be reviewed separately, as it will touch many files.

The thing that is currently worrying me most about this portion of the plan is renaming config.number to config.parse.number to recognize the fact that it primarily controls the reading of numerical values. However, it does control just a couple of other unrelated things in the current system. Chief among these is the type returned by numerical constants such as tau and e. Clearly, that type has nothing to do with parsing numerical constants. So that aspect should definitely not be controlled by an option named config.parse.number.

So my proposal is to go ahead and introduce config.compute.numberApproximate that we had already planned in issue 3374, and have that control the type used for these constants, since after all the constant values are necessarily just approximations. When we get farther on the ideas in #3374, it will likely control more things, but there is something for this new option to control right away -- and it will immediately relieve some of these concerns where setting the number option to bigint makes it impossible to get the high-precision BigNumber version of pi, etc. Now you could set config.parse.number to bigint but config.compute.numberApproximate to BigNumber and overcome that exact problem.

For backwards compatibility, setting deprecated config.number would set both config.parse.number and config.compute.numberApproximate.

How does that all sound?

gwhitney avatar Nov 08 '25 04:11 gwhitney

Sorry that this may get long as I dive into the config details. There is actually a similar issue with the config.matrix option. It controls what type the expression [1, 2, 3] produces, as well as what type the function math.identity(3) outputs. It would seem the former is a parsing matter, and the latter is a computation matter. So how should the new config system respond?

(A) Split the option into compute.Matrix.defaultType and parse.matrix? That would seem the most logical parallel to compute.numberApproximate, parse.number, and parse.numberFallback. But whereas there is a clear use case for setting these options to BigNumber, bigint, Fraction, respectively (in fact, what we will do in NumberScope), I can't think of a scenario in which one wants to parse literal matrices as Matrix but have functions that generate matrices return Arrays. The vice-versa seems unlikely, too, although a bit less farfetched, however: perhaps you are using external functions that don't know how to deal with Matrix objects and you want a simple way to provide them with Array arguments (as has been noted in discussions here, it's a bit tricky to make Arrays in the expression language when config.matrix is set to Matrix), but you also want internal computations to benefit from the extra capabilities of Matrix. Nevertheless, generally it seems as though one wants to compute with one matrix representation uniformly.

(B) Take the point of view that [1, 2, 3] isn't really a matrix literal, but rather it's the matrix-making operator [...] applied to the argument list 1, 2, 3. (Indeed, one could change the parser to produce Operator/FunctionNode to represent instances of [] rather than have an ArrayNode for them, with essentially no loss or complication.) From this perspective, it's completely reasonable for compute.Matrix.defaultType to determine what type the [] operator produces.

For the sake of simplicity/conservatism, in the absence of a clear use case for splitting, I will go with option (B) but am happy to switch to (A) if you prefer.

gwhitney avatar Nov 08 '25 15:11 gwhitney

Another point: there are a handful of instances where the old config.number was used to select the output of an operation that did not involve parsing, but is not approximate. For example, unaryPlus(true) was returning a one of the type config.number. That is of course an exact value. Since this is the result of computation, not parsing, it should now use a config.compute.xxx parameter. For now I am using config.compute.numberApproximate to keep the number of new options down, but the name of the option is not a good fit for these cases. Should we:

(A) Just leave these as funny uses of the config.compute.numberApproximate option?

(B) Rename the option, perhaps to config.compute.numberFallback to be parallel to config.parse.numberFallback -- the number type to use for computation when the type you have doesn't work for that purpose?, or

(C) Introduce a config.compute.numberExact in addition to config.compute.numberApproximate to cover these cases in which an exact value can be selected, like for the numeric conversion of boolean true, but the type to represent that exact value in has to be selected?

Thanks.

gwhitney avatar Nov 10 '25 10:11 gwhitney

Actually, the situation is a bit worse than outlined just above. Formerly, if one set number to Fraction, the sum of an empty matrix would be Fraction(0, 0). But now under (A), it's 0 of the compute.numberApproximate type, and that config option does not currently support Fraction. In other words, it seems as though alternative (A) cannot reproduce the former behavior, and so of course (B) can't either since it's just a naming variant of (A). So that would seem to indicate that (C) is necessary to reproduce the desired behavior. But option (C) means introducing another new parameter, which will be used in just a small handful of places (sum and prod of empty matrices, unary plus of a boolean, not sure if there is anything else).

So I am going to pause on the reorganization of the parameters until you have a chance to provide some thoughts/feedback, because I don't want to proliferate options too much, but I do think we should be able to reproduce the old behavior. So maybe there is some alternative I am not seeing,

gwhitney avatar Nov 10 '25 10:11 gwhitney

Actually, having slept on this latest concern, I have a proposal: our use case (bigint for parsing integers, fraction for parsing everything else, bignumber for sqrt(2) etc.) shows the need for at least three parameters, but it would be very good to keep it down to three parameters if possible.

So I propose to go back to top-level config.number and not rename it to config.parse.number and retain its old semantics as the default number type to try to use in most circumstances, both in parsing and in computation. But it would be supplemented by two other parameters: config.parse.numberFallback which is used (only) in parsing for numeric literals that cannot be represented by config.number; and config.compute.numberApproximate which is used when the result of a computation can't be well represented in the input type or the config.number type.

I will proceed with that plan in the draft PR until I hear other thoughts if any.

gwhitney avatar Nov 10 '25 16:11 gwhitney

OK, the entire initial config-option reorganization is now submitted as a PR to the (new) v16 branch, so that you can take a look and see what you think. I imagine you may want some adjustments, and happy to do so. But this way we can get it all worked out, and once it is merged, proceed with this parsing refactor on a solid foundation where it will be clear where the parse parameters should migrate to.

gwhitney avatar Nov 11 '25 21:11 gwhitney