Partial scanner refactor

Open rocky opened this issue 7 months ago • 0 comments

To handle escape sequences better, such as ignoring them in comments, branch "revise-escape-sequence-scanning" was started as more major refactor.

However, that has become too large and is too hard to get right. This PR splits the non-escape sequence portion of the changes. After this, the remaining escape sequence changes will be added back.

As a result, this code does have a few forward-looking changes and do not have any use or benefit in the current code base.

.github/workflows/mathics.yml : Adjust testing to work on the "mini-tweaks" branch of mathics-core
scanner_scanner/errors.yml: revise TranslateError classes so that they capture more information about the error that was raised. In particular, save the error name, tag, and arguments. This allows exception handlers more flexibility in how to handle error exceptions. For example, a handler should now have enough information to call the message() routine to display the error. Or do this only in certain situations. It also makes debugging easier, since uncaught errors have more information. Temporarily, there is TranslateErrorNew() in addition to TranslateError(). We may decide to replace all occurrences of TranslateError() with TranslateErrorNe,w(), but that decision and work is left for later.
mathics3-tokens.py: add error handlers for problems in tokenization.
mathics_scanner/precanner.py: incomplete() -> get_more_input(). "incomplete" is an adjective, not a verb; functions generally should be verbs or actions. "Incomplete" is a property about a situation as (e.g., reading the line of input is not done yet) as opposed to a statement about what should be or needs to be done: we need to read another line of input.
mathics_scanner/tokeniser.py: replace .format() with f-string. Comment more extensively. self.code -> self.source_text. Self.code is vague (and as someone who works with Python bytecode) often misleading to me. "code" would also fit S-expression code, Python source code, Python bytecode, some tag-like "code" that you might find in an enumeration, something else?
Add property: Token.is_inside_box. This is not used yet here, but will be in the next refactor. It is used for the parser to the tokenizer whether we are inside ( ... ). When this happens, certain additional escape sequences are allowed like %.
match -> pattern_match. GNU Emacs seems to think match from re.match is a reserved word.

May 21 '25 02:05 rocky