qiskit-qasm3-import icon indicating copy to clipboard operation
qiskit-qasm3-import copied to clipboard

Replace this package with complete parser infrastructure

Open jakelishman opened this issue 1 year ago • 0 comments

This package was originally a proof-of-concept importer for OpenQASM 3 programs into Qiskit circuits, written in a couple of days, reusing the reference OpenQASM 3 parser and supplying only the minimal layer necessary to convert to Qiskit.

Motivation

Our intention is now that Qiskit will treat OpenQASM 3 as a first-class citizen in data input. This involves writing complete parsing and transpilation infrastructure, rather than bolting things on quickly on top of an ANTLR representation of the grammar that was principally meant to be illustrative to humans.

Problems with the current system (as implied by this package):

  • the Python-space ANTLR-generated parser is slow, and would involve complex machinery to speed up - we'd have to use some compiled version of the ANTLR runtime and write wrapping interfaces to that in Python.
  • the reference ANTLR grammar is designed to be human readable, and so is not suited for emitting diagnostics from bad inputs, nor attempting recovery from them.
  • ANTLR cannot generate semantic typing information, which we need. This package currently evaluates types on the fly, but this is not a sustainable architecture. We need a new, efficient typed IR that the conversion layer can consume.

Proposed new architecture

Layers from lowest-level to application-specific:

  1. a shared library that implements the lexing and parsing stages, roughly equivalent to what we get from ANTLR currently, producing an untyped AST
  2. a shared library that lowers the untyped AST to some typed IR
  3. (if necessary) a conversion layer that exposes the validated program in this typed IR to Python, in a way that it can be directly interpreted by Qiskit (or other applications)
  4. Python-space code belonging to Qiskit Terra that takes the IR from step 2 or 3 as appropriate into a QuantumCircuit.

Layers 1 and 2 may well be distributed as part of the same library, as appropriate. Layer 3 may not be necessary, depending on the details of the typed IR. For example, if this hypothetical IR is a linear bytecode stream such as employed by Qiskit's current OpenQASM 2 conversion layer, it could easily and efficiently be exposed to Python directly, skipping the need for a third representation. However, for performance reasons, it's best to keep as much as possible in compiled languages and spend as little time as possible in Python space. This likely results in a tighter coupling between application code and the bytecode stream, and may mean that Python-exposed stream may want further semantic analysis/processing in a compiled language, and a third layer for each specific application may be warranted.

Qiskit's current importer for OpenQASM 2 code effectively has layers 1, 3 and 4 here - the typing is computed on-the-fly during the lowering from the untyped AST to a Qiskit-specific bytecode stream, skipping layer 2. The lexer, parser and bytecode stream are all implemented in Rust (belonging to the Qiskit Terra repository). Only the bytecode interpreter lives in Python space. We expect that the increased complexity of OpenQASM 3 will make this four-layer architecture more maintainable and performant.

Implementation details

We plan to write layers 1 and 2 in Rust, using a lexer generator and a hand-written parser based on a hybrid recursive-descent / Pratt strategy. The choice of Rust is because the Qiskit Terra team (and the wider IBM Quantum team) already has institutional experience in Rust (including writing parsers in Rust), and Rust compiles easily to all of Qiskit's supported platforms unlike other common parser-host languages like C or C++, and Qiskit's build and deployment procedures are already set up to compile Rust code. The Qiskit team has significant experience in the Rust/Python interface, and this supports use of the CPython stable ABI, minimising the number of wheels needing to be built and deployed.

We propose to avoid a parser generator to have more control over diagnostics and error recovery, and to make it easier to dynamically switch to other compiled-in parsers based on supplied defcalgrammar statements in given files (although we do not initially intend to support calibration grammars during the import to Qiskit, pending better support in Terra for representing these pulse-level programs).

Layers 1 and 2 will be developed separately to Qiskit Terra, and published in Rust crates with (likely) no Python bindings. We expect our layers 3 and 4 to be owned within the Terra repository (in crates/qasm3 and qiskit/qasm3 respectively), and for them to be considered implementation details within Terra. If we do supply Python bindings for layers 1 and 2, it will be as a separately compiled extension module, though this is not an immediate goal.

Implementation plan

The Qiskit Terra team will first agree on the typed IR representation exposed as the ABI of layer 2. We can then have a split of resources, with one person/team writing the parser components, and another writing the Qiskit-specific components belonging in Terra.

Currently, we expect that @jakelishman will oversee the project and write much of the parser components, while @jlapeyre and/or @kevinhartman will write the Qiskit interface components.

This effort will be done by IBM team members; we are not soliciting contributions from the open-source community on this effort (yet!).

Alternatives considered

Continuing to use the ANTLR-based parser

While this supplies layer 1, the simplicity of using the Python version of the ANTLR runtime results in a very slow parser with poor diagnostics. If Qiskit is to treat OpenQASM 3 as a first-class citizen for defining data, this is not an acceptable user experience.

The reference ANTLR parser is primarily intended to be a human-readable specification of the OpenQASM 3 grammar. Efforts to change this to make a more performant / better diagnostic / better recovering parser would run counter to the main goal of the ANTLR grammar.

Using the existing IBM parser

IBM already have a functioning OpenQASM 3 parser, written in C++ based on flex and bison, which supplies layer 1 and some of layer 2. The principal reasons for not reusing this are that it was not originally designed to be used in re-entrant methods, and it was only required to run on platforms on the IBM server side. The choices made are completely fine for its intended uses, but the effort required to rework it to support all of Qiskit's platform requirements and to provide completely re-usable entry points without re-initialisation of the shared library are expected to be similar to the cost of rewriting the parsing infrastructure, while risking compromising the performance of its current uses. This would also require adding C++ components to Qiskit's build processes, which would add a very large amount of complexity, and would be very difficult to provide the required platform support with.

This new set of libraries is intended to complement, not replace, the existing IBM tooling.

jakelishman avatar Jun 12 '23 17:06 jakelishman