Restrict base type of array to documented types
Previously, bit and stretch were allowed as the base type of the array type. These two types are explicitly disallowed by the specification.
This commit modifies the parser to reject these invalid base types. In the parser definition file, the base type of the array type is changed from scalarType to a new rule arrayBaseType. The latter is just a copy of scalarType with the invalid base types removed.
The type bit appeared in some parser tests as the array base type. These tests have been removed.
Two tests have been added to verify that the parser produces an error if either bit or stretch appears as a base type.
imo (and I feel like it's been discussed in meetings before) this is not something that should be done by the grammar. It's a semantics concern, just there's no semantic analysis in this repo. Adding more special cases to the grammar makes it harder to read for its intended purpose, which was to be a human readable indication of the grammar of the language.
Like @jakelishman my recollection is that there was some consensus not to check semantics in the parser. However, this could be re-discussed or we could add tools on top of the AST that check such things.
I agree that the parser should not do semantic analysis. But, it seems to me that this distinction is syntax rather than semantics.
To illustrate, I could write a parser rule to detect the type error below:
let s: String = "hi";
def f(x: Int) {
}
f(s);
But writing rules for all constructs of similar complexity would complicate the parser enormously (exponentially). On the other hand, once you have the infrastructure for semantic analysis, that is, a symbol table, etc., then detecting this error is easy. This kind of type checking is the main task of semantic analysis.
On the third hand, to detect the error addressed in this PR, a very simple parser rule is sufficient. No symbol table, or binding of identifiers, is needed.
Furthermore, if you want to keep the parser grammar simple, you can replace this,
scalarType:
BIT designator?
| INT designator?
| UINT designator?
| FLOAT designator?
| ANGLE designator?
| BOOL
| DURATION
| STRETCH
| COMPLEX (LBRACKET scalarType RBRACKET)?
;
with this,
scalarType:
(BIT
| INT
| UINT
| FLOAT
| ANGLE
| BOOL
| DURATION
| STRETCH
| COMPLEX (LBRACKET scalarType RBRACKET)?) designator?
;
This works fine, except that a couple of the tests for invalid syntax fail, so you have to remove those.
You can simplify the grammar even further by adding QUBIT to the rule scalarType. This makes the rules
quantumDeclarationStatement: qubitType Identifier SEMICOLON;
qubitType: QUBIT designator?;
as well as those that depend on them, redundant. The keyword qubit can be recognized when necessary at a later stage. However, it is syntax, and is more properly distinguished in the parser.
The type system of OpenQASM 3 is currently quite simple. In particular, the number of possible types is finite and not too large (if you don't include the lengths along dimensions of arrays, and things like that) I suspect that it is possible to encode all valid types in the parser grammar without significantly more complexity than exists now.
If this is possible, then it's reasonable to ask which parts of the syntax are omitted and why.
If the goal, or an important one, of the reference grammar is to serve as a model for writing other parsers, then I think the grammar should be as complete as possible. This is my main argument for including the rule (or equivalent solution) in this PR.
However, even if we believe that this grammar rule is part of the syntax and isn't secretly encoding semantics, and even if it is the last piece, the one that makes the rules specifying the type system complete, you could consider arguments for omitting it.
For example, how prominent is the idea that is expressed in the rule in the written language specification? Which scalar types accept designators, or which can be base types of arrays is obviously much less important in the language description than the distinction between quantum and classical types. If part of the goal of the reference parser (and the text file specifying its grammar) is to help the reader understand the language, then keeping qubits distinct from scalar types is more important than keeping width-less scalar types distinct. Furthermore, you could argue that certain details of the syntax are a distraction from the goal of aiding in understanding the language.
In summary, I believe the question of which base types are allowed for arrays is a matter of syntax, not semantics. If it were up to me, I'd find a way to express it in the grammar, since it is a reference grammar.
On the final hand, I don't think it's an open-and-shut case. There are reasonable arguments for omitting it.
I was conflicted about this one initially, but after thinking it for a while I am leaning more towards accepting the change. It always felt odd to me when reading the reference grammar that arrays would take scalarType as base types, but then the spec would tell me otherwise.
This is the way we ended up doing it in our parser as well, we have a ScalarType and an ArrayBaseType, which makes further compilation steps simpler.
I don't think this is part of the semantics of the language. I feel that the parts of the language that should be left for semantic analysis are the constructs that change their rules depending on where in the program they appear. Array base types isn't one of them, no matter where in the program an array is declared, the base types are always the same, therefore, this is syntax.
If OpenQASM had user defined types, a full generic type system, and constraints on which types can be used as parameters, I would think otherwise; but OpenQASM doesn't have user defined types, array is one of OpenQASM few types and the base types that can be used to instantiate an array are always the same everywhere in an OpenQASM program.
I agree that this can be done syntactically because of OQ3's limited type system, though I still consider this a semantic concern, even if it's one that can be checked in a context-free free grammar. Rejecting it in the ANTLR parser has the knock-on effect of producing terrible error messages for a user without significant effort that this repo doesn't do, and failing to produce an AST precludes downstream tooling from doing it too.
I would expect a fully featured parser to provide an AST the restricts the types, but I would also expect such a package to recover the parse through at least as far as constructing a parse tree, rather than failing at this step. I think making this restriction in the ANTLR grammar makes it harder to read the structure of the grammar for a human, and harder to build tooling off this minimal repo.
I don't feel so strongly as to want to block the PR entirely because I do also see the merits, but I am softly against it in the context of this repo.
It sounds like there are a couple of issues that perhaps need clarification. There is a the questions of "What is the grammar for the language?" and "What is the ANTLR grammar that we can use for code generation to drive the reference parser?".
It would be great if these two were the same, but without figuring out how to do error handling, which would be tedious and error prone in itself, I think we should draw a distinction between the two.I don't like a duplication, but a secondary grammar for ANTLR is an implementation detail of the parser and not of the language.
I like the change in this PR, but I think we should close the PR and have another which splits the ANTLR grammar from a language grammar which has this type of distinction, but also fixes bits such as the statements that are allowed in gate defs. There are a lot of things allowed in the ANTLR grammar that really don't exist in the language.
Clarifying the difference between the grammar of the language and the grammar as encoded in ANTLR does make sense. It's sort of in this issue thread, but the concerns are not clearly separated.
I mostly agree that the ANTLR grammar is an implementation detail. But it's one with an extra purpose (assuming @jakelishman is correct here)
... makes it harder to read for its intended purpose, which was to be a human readable indication of the grammar of the language
I agree that an artifact that is unambiguously the grammar of the language is desirable.