ModelicaSpecification icon indicating copy to clipboard operation
ModelicaSpecification copied to clipboard

Remove strange escape sequences from Q-IDENT

Open henrikt-ma opened this issue 3 years ago • 8 comments

This is a follow-up to a discussion about character set restrictions applying to string content where it was noted that we say that Q-IDENT may contain any printable ASCII character, while in fact we also allow some sort of escape sequences that look like they may actually not represent printable characters, see: https://github.com/modelica/ModelicaSpecification/pull/3079#discussion_r796293013

This is the current definition of S-ESCAPE, used in the definition of Q-IDENT (note: LaTeXML messes up the "\\" if you look at the HTML build):

S-ESCAPE = "\’" | "\"" | "\?" | "\\"
   | "\a" | "\b" | "\f" | "\n" | "\r" | "\t" | "\v"

There are two problems I'd like to address:

  • Are the escape sequences real escape sequences (encoding for another, possibly non-printable, character), or only special two character combinations that are allowed as part of a quoted identifier?
  • The ambiguous representation of '?' and '"' in identifiers is a true mess.

To show what the lastly mentioned mess is about, this is currently valid:

model QIdentMess
  Real '\"';
equation
  '"' + '\"' = 2;
end QIdentMess;

I suggest we straighten this out as follows:

  • Don't allow S-ESCAPE in Q-IDENT.
  • Instead, only allow the two escape sequences that allow a single quote to be included in the body of a Q-STRING, and make them true escape sequences:
Q-IDENT = "’" { Q-CHAR | Q-ESCAPE } "’"

Q-ESCAPE = "\\" | "\'"

By making them true escape sequences I mean that, for example, '\\' is a three character quoted identifier (single quote, backslash, single quote).

henrikt-ma avatar Feb 01 '22 06:02 henrikt-ma

I apologize for any confusion, but I am now going to argue against what I suggested above. Letting the Q-ESCAPE sequences be true escape sequences is a bad idea when combined with the delimiting Q-IDENT quotes being part of the identifier.

Rationale

The problem is that delimiting quotes generally serve as valuable hints about the level of up-quoting. For comparison, let's consider Modelica strings.

If I say that the string is foo\"bar (intentionally not using inline code markup to leave room for interpretation), then you can immediately tell that this is not the up-quoted form of a string. This must be the 8 character string foo\"bar, corresponding to the Modelica string literal "foo\\\"bar".

If, on the other hand, I say that the string is "foo\"bar", then the surrounding quotes give you a good reason to interpret this as the up-quoted form of the 7 character string foo"bar, corresponding to the Modelica string literal "foo\"bar". Of course, there's a risk that it isn't the up-quoted form of the string, so that it really corresponds to "\"foo\\\"bar\"", but it's unlikely.

The point is that when you see delimiting quotes (and down-quoting is well defined), you have good reason to think that this a string expressed in the form of a Modelica string literal. When there are no delimiting quotes, you know you're looking at the actual string content. For Q-IDENT, the problem is that the delimiting quotes can't be used to tell the Modelica lexical token apart from the identifier it represents (corresponding to the down-quoted value of a string literal).

Say I give you the identifier '\\' (again avoiding inline code markup to leave room for interpretation). Are you now looking at the display of the actual variable name, or a Modelica lexical token representing that name? The former is a 4 character quoted identifier (quote, 2 backslashes, quote), while the latter is a 3 character identifier (quote, 1 backslash, quote). This ambiguity is the result of surrounding delimiters being part of the actual identifier, and true escape sequence semantics.

As I don't think we're going to change the rule about delimiting quotes being part of the actual identifier, the only option that remains to fix this is to not interpret Q-ESCAPE as true escape sequences. This way, when I give you the identifier '\\', you don't have to guess whether you're looking at the Modelica lexical token or the actual identifier, because they are both the same 4 character quoted identifier (quote, 2 backslashes, quote).

But now, if \' is not a true escape sequence in Q-ESCAPE, what makes it so important to have this fake escape sequence at all? If there wasn't a need to allow the ' inside a Q-IDENT we could get rid of the fake escape sequences altogether, greatly simplifying matters. Further, removing the special role of \ would make it easier for tools that want to encode general string data in a Q-IDENT, as the tool can assign its own specific role to \ without ending up with the horrible double layers of backslash escaping all over the place (anyone having written regular expressions or LaTeX code in the form of plain C string literals will know what I mean).

Proposed design

Since I don't see much value of allowing the sequence \' without seeing it as an encoding of just a quote inside a Q-IDENT, I suggest that we simply remove the possibility for having ' inside Q-IDENT altogether, and remove the special meaning of \.

All old Q-IDENT identifiers will thus remain valid as long as they don't contain the sequence \', and there will be no more confusion regarding the meaning of a quoted identifier such as '\a' – this is simply a 4 character quoted identifier, as there is nothing special about the backslash.

To allow for a smoother transition, it might be desirable to start with a deprecation period when \' is still allowed inside Q-IDENT. During the deprecation period we'd have the following:

Q-IDENT = "’" { Q-CHAR | Q-ESCAPE } "’"

Q-ESCAPE = "\" ( Q-CHAR | "\" | "'" )

That is, during the deprecation period:

  • Tools must give a deprecation warning whenever the Q-ESCAPE matches the '.
  • '\'' is a valid quoted identifier that shall trigger the deprecation warning.
  • '\' is not a complete quoted identifier, possibly munging the rest of the line when looking for a terminating delimiting quote.

After the deprecation period we'd switch to this:

Q-IDENT = "’" { Q-CHAR | "\" } "’"

That is, after the deprecation period:

  • '\'' is a valid quoted identifier immediately followed by the start of another quoted identifier, possibly munging the rest of the line when looking for a terminating delimiting quote of the second identifier.
  • '\' is a valid quoted identifier.

henrikt-ma avatar Feb 01 '22 22:02 henrikt-ma

I would prefer to just remove the odd control-variants and keep \\, \', \", and \? as the only allowed quotes in quoted identifiers, but with the current semantics. The reason is that it allows quoting any name using printable characters in a predictable way and it uses the same syntax as for string quoting, which is similar to the string quoting in many other languages.

The two issues that are slightly less clear: Do we really need \? at all? (It is needed in C due to trigraphs, but ....) Do we need \" (the same as ") in quoted identifiers. It simplifies the mental model - but is it really needed?

HansOlsson avatar Feb 02 '22 10:02 HansOlsson

Postpone again

HansOlsson avatar Jun 14 '22 15:06 HansOlsson

@christoff-buerger requested a solution that makes it possible to define a reversible name mangling scheme. If \ would be a normal character and ' not be allowed inside quoted identifiers, one could define a mangling scheme with \ as escape character:

  • \\ represents \ (backslash)
  • \, represents ' (single tick)

I think an interesting use case is how to upquote and downquote Modelica component references, such as: a.'b\c'[1] With the scheme above, the mangled component reference would become: 'a.\,b\\c\,[1]'

This is only one example of several mangling schemes that can easily be defined when backslash is a normal character in a Modelica identifier, and I don't think that the Modelica specification should go into the details of which scheme to use.

henrikt-ma avatar Jun 14 '22 15:06 henrikt-ma

@christoff-buerger requested a solution that makes it possible to define a reversible name mangling scheme. If \ would be a normal character and ' not be allowed inside quoted identifiers, one could define a mangling scheme with \ as escape character:

  • \\ represents \ (backslash)
  • \, represents ' (single tick)

I think an interesting use case is how to upquote and downquote Modelica component references, such as: a.'b\c'[1] With the scheme above, the mangled component reference would become: 'a.\,b\\c\,[1]'

This is only one example of several mangling schemes that can easily be defined when backslash is a normal character in a Modelica identifier, and I don't think that the Modelica specification should go into the details of which scheme to use.

But if backslash as currently defined isn't a normal character we can view it as \\ representing \ and \' representing ' leading to a.'b\c'[1] being quoted as 'a.\'b\\c\'[1]'.

So, in one sense it is a matter of mapping a.'b\c'[1] to 'a.\,b\\c\,[1]' or 'a.\'b\\c\'[1]'.

However, one is already existing - the other will require a change the semantics. Both can lead to really long and winded results and to me the current variant is better, since it is similar to how quoting works in other cases and thus someone just looking at will quickly get the result. I simply don't see any advantage with changing something that works, and I don't want to spend more time on this.

HansOlsson avatar Jun 15 '22 07:06 HansOlsson

I see two objectives, where a possible compromise might be to sacrifice one for the other:

  1. Simplicity of identifiers (escape mechanisms add unwanted complexity, and makes it more complicated to build custom mangling schemes)
  2. Ambiguous meaning of displayed form of quoted identifiers (escape mechanism interpretation would typically mean that '\\\\' should be displayed as '\\')

One alternative design that would largely address 2 by sacrificing 1 would be take the design suggested above for the deprecation period, but re-brand things to further reduce confusion:

Q-IDENT = "’" { Q-CHAR | Q-PAIR } "’"

Q-PAIR = "\" ( Q-CHAR | "\" | "'" )

Whit this design, it becomes more clear that there are no special escape sequences inside Q-IDENT; all there is are the two character pairs where the first must be a \.

Edit: (The only problem with this design from my point of view is the presence of "'" in Q-PAIR; without it, one could simply make "\" another alternative in Q-CHAR. This is what I mean by the proposal in this comment being a compromise that sacrifices 1 in order to get most of 2.)

henrikt-ma avatar Jun 15 '22 09:06 henrikt-ma

I see two objectives, where a possible compromise might be to sacrifice one for the other:

  1. Simplicity of identifiers (escape mechanisms add unwanted complexity, and makes it more complicated to build custom mangling schemes)
  2. Ambiguous meaning of displayed form of quoted identifiers (escape mechanism interpretation would typically mean that '\\\\' should be displayed as '\')

One alternative design that would largely address 2 by sacrificing 1 would be take the design suggested above for the deprecation period, but re-brand things to further reduce confusion:

Q-IDENT = "’" { Q-CHAR | Q-PAIR } "’"

Q-PAIR = "\" ( Q-CHAR | "\" | "'" )

This almost seem right for me, but I would have: Q-PAIR = "\" ( "\" | "'" ) alternatively: Q-PAIR = "\\" | "\'"

HansOlsson avatar Jul 01 '22 13:07 HansOlsson

This almost seem right for me, but I would have: Q-PAIR = "\" ( "\" | "'" ) alternatively: Q-PAIR = "\\" | "\'"

I'm afraid you missed the point of the proposal. The point of having Q-PAIR = "\" ( Q-CHAR | "\" | "'" ) is that it removes the sensation of having escape sequences in Q-IDENT by not allowing just a few selected characters after the backslash.

henrikt-ma avatar Jul 02 '22 23:07 henrikt-ma