reference icon indicating copy to clipboard operation
reference copied to clipboard

Note that raw string literals are context-sensitive

Open Gankra opened this issue 2 years ago • 10 comments

This was originally discussed with a full formal proof in

https://github.com/rust-lang/rust/blob/5187be620c76a313a19b9b596e1bce3a80a345dd/src/grammar/raw-string-literal-ambiguity.md

but that file was removed as part of the "legacy" grammar of Rust. Over the years the file has been linked many times and is genuinely non-obvious and useful to know (because it means any attempt to express Rust in entirely EBNF is futile). I don't think the full formal proof is actually necessary, we can just appeal to an example and mention the "context-free languages can't compare 3 numbers" rule. To minimize confusion, I'm also including a note that the practical implications are relatively minor for an actual practical parser.

(Full disclosure I am the original author of the proof and it was my first contribution to Rust, so I am 100% sentimental about it, but also, it is just genuinely useful information.)

Gankra avatar Mar 25 '22 15:03 Gankra

A while ago I went looking for that exact document and couldn't find it. Maybe you could link to the proof in PR changes too?

bjorn3 avatar Mar 25 '22 15:03 bjorn3

Sorry which link do you want where?

Gankra avatar Mar 25 '22 16:03 Gankra

https://github.com/rust-lang/rust/blob/5187be620c76a313a19b9b596e1bce3a80a345dd/src/grammar/raw-string-literal-ambiguity.md

bjorn3 avatar Mar 25 '22 16:03 bjorn3

Done.

Gankra avatar Mar 25 '22 16:03 Gankra

(sorry doing this via github UI, feel free to squash it all)

Gankra avatar Mar 25 '22 16:03 Gankra

I'm curious, is this true even if you consider the number of # is limited? The discussions I've seen of this (and the linked proof) seem to assume that there is no limit. Couldn't one define a regular grammar with a fixed number of rules, one for each balanced count?

I would think that nested comment blocks (which have no limit AFAIK) would be a better example of Rust being non-regular.

ehuss avatar Apr 09 '22 19:04 ehuss

This is so irrelevant in practice that I'm not even sure it's something that needs to be highlighted in the docs.

The separation into lexer and parser in Rust is pretty strong, due to macros for example, so details like this do not even belong to the grammar, if we assume that grammar starts from the parser.

petrochenkov avatar Apr 09 '22 19:04 petrochenkov

It's an explicit problem for anyone who tries to implement a parser and tries to use tools "for" context-free-grammars, which is like, the standard (bad) tooling everyone recommends.

I don't see why comments would be problematic, they're just balanced parentheses, no?

Gankra avatar Apr 09 '22 19:04 Gankra

This is so irrelevant in practice that I'm not even sure it's something that needs to be highlighted in the docs.

It is relevant in that it proves you can't write a correct syntax highlighter using standard regexes like is common in most editors.

bjorn3 avatar Apr 09 '22 19:04 bjorn3

I'm curious, is this true even if you consider the number of # is limited? The discussions I've seen of this (and the linked proof) seem to assume that there is no limit. Couldn't one define a regular grammar with a fixed number of rules, one for each balanced count?

Yes, though it isn't documented in the reference atm (#1180), the number of # is limited. The limit was recently reduced to 255. This should make the raw strings context-free, though the EBNF wouldn't look very good; I reckon you'd need 256 rules, one for each length.

@Gankra Maybe a note about that should be added to the proof?

GrishaVar avatar Apr 25 '22 19:04 GrishaVar