ChezScheme icon indicating copy to clipboard operation
ChezScheme copied to clipboard

tokenizer including comments

Open bjornkihlberg opened this issue 1 year ago • 10 comments

Hello,

One of the first things I built for Chez Scheme was a formatter using pretty-print which was quite convenient except that it removes comments, reduces rational numbers (eg 2/10 becomes 1/5) and literals like binary literals are turned into base 10 numbers.

Is there a way to access the reader directly to conserve these or do I need to write my own reader to support this syntax in my formatter?

// Regards

bjornkihlberg avatar Oct 03 '22 17:10 bjornkihlberg

Is read-token what you are looking for?

soegaard avatar Oct 03 '22 17:10 soegaard

@soegaard Unfortunately it isn't. Here's a quote from the documentation on read-token:

Parsing of a Scheme datum is conceptually performed in two steps. First, the sequence of characters that form the datum are grouped into tokens, such as symbols, numbers, left parentheses, and double quotes. During this first step, whitespace and comments are discarded.

Here are some examples:

> (read-token (open-input-string "; hey"))
eof
#!eof
5
5
> (read-token (open-input-string "2/10"))
atomic
1/5
0
4
> (read-token (open-input-string "#b10"))
atomic
2
0
4

bjornkihlberg avatar Oct 03 '22 17:10 bjornkihlberg

read-token will get you there. It just takes work. You can analyze the bfp/efp to determine if there was something in between the tokens that Chez Scheme read for you. You can then construct your own token structures to stash those values. You may even go as far as maintaining the actual input text for every token.

Swish-Lint uses this technique with a a little coroutine magic to make the code semi-readable. Maybe seeing an example will help: https://github.com/becls/swish-lint/blob/a8a9be2b90a0ba56657a0406629cec5af44be270/indent.ss#L134

laqrix avatar Oct 03 '22 18:10 laqrix

In the example:

> (read-token (open-input-string "2/10"))
atomic
1/5
0
4

the numbers 0 and 4 gives you the part of the input string, from which 1/5 was read. This is enough to recover the string "2/10".

soegaard avatar Oct 03 '22 18:10 soegaard

Also, if one token starts later than the previous token ended, some kind of whitespace/comment was skipped.

soegaard avatar Oct 03 '22 18:10 soegaard

I'm interpreting your answers to mean that there is no access to the reader.

bjornkihlberg avatar Oct 03 '22 18:10 bjornkihlberg

Does read expose a secret argument or procedure to let you customize its behavior to read comments? No. Does Chez Scheme provide a different interface to read tokens from a stream of characters? Yes, via read-token. You may or may not have considered that read attempts to read complete expressions. If your indenter needs to process incomplete expressions, then read is likely not the interface you want. read-token will give you access to the individual tokens and you can reassemble them how you wish.

laqrix avatar Oct 03 '22 19:10 laqrix

Maybe laesare would be closer to what you need. It has some corners where formatting is lost in a roundtrip, but I will happily accept patches to fix that, as your use case is one of the things I had in mind when I wrote it.

weinholt avatar Oct 03 '22 19:10 weinholt

The behavior you want to change is baked in to the tokenizer. There is no more direct access to the tokenizer than read-token, and even if there were you wouldn't be able to change this behavior without redefining the reader (or creating a new one).

Preserving comments is an interesting problem that is considerably more difficult that it would first appear. There was, at one point, a "comment collector" interface that worked in conjunction with get-datum/annotations on the Cisco-internal version of Chez Scheme. Kent spent some time cleaning that up and trying to work through all of the corner cases so that it could be merged into the open source version, but ultimately was never able to come up with a clean interface and handle all of the strange corner cases we could dream up. We ended up abandoning that approach and going with a dedicated reader implementation instead.

jltaylor-us avatar Oct 03 '22 20:10 jltaylor-us

@weinholt Yep, that looks like it could help me.

@jltaylor-us Thank you for the context! PS. read.ss got some interesting ideas in it.

bjornkihlberg avatar Oct 03 '22 20:10 bjornkihlberg