sqlite-database-integration icon indicating copy to clipboard operation
sqlite-database-integration copied to clipboard

Exhaustive MySQL Parser

Open adamziel opened this issue 1 year ago • 11 comments

[!NOTE]
You're the most welcome to take over this PR. I won't be able to drive this to completion before November. Do you need a reliable SQLite support? You can make this happen by driving this PR to completion (and fork if needed)! Are you dreaming about running WordPress on PostgreSQL? Finishing this PR is the first step there.

Read the MySQL parser proposal for the full context on this PR.

Ships an exhaustive MySQL Lexer and Parser that produce a structured parse tree. This is the first step towards supporting multiple databases. It's an easier, more stable, and an easier to maintain method than the token processing we use now. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.

We don't have an AST yet, but we have a decent parse tree. That may already be sufficient – by adjusting the grammar file we can mold it into almost anything we want. If that won't be sufficient or convenient for any reason, converting that parse tree into an AST is a simple tree mapping. It could be done with a well crafted MySQLASTRecursiveIterator and a good AI prompt.

Implementation

The three focal points of this PR are:

  • MySQLLexer.php – turns an SQL query into a stream of tokens
  • DynamicRecursiveDescentParser.php – turns a stream of tokens into a parse tree
  • run-mysql-driver.php – proof of concept of an parse-tree-based query conversion

Before diving further, check out a few parse trees this parser generated:

MySQLLexer.php

This is an AI-generated lexer I initially proposed in #153. It needs a few passes from a human to inline most methods and cover a few tokens it doesn't currently produce, but overall it seems solid.

DynamicRecursiveDescentParser.php

A simple recursive parser to transform (token stream, grammar) => parse tree. In this PR we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have specific properties).

The parse_recursive() method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.

run-mysql-driver.php

A quick and dirty implementation of what a MySQL parse tree ➔ SQLite database driver could look like. It easily supports WITH and UNION queries that would be really difficult to implement the current SQLite integration plugin.

The tree transformation is an order of magnitude easier to read, expand, and maintain than the current. I stand by this, even though the temporary ParseTreeTools/SQLiteTokenFactory API included in this PR seems annoying and I'd like to ship something better than that. Here's a glimpse:


function translateQuery($subtree, $rule_name=null) {
    if(is_token($subtree)) {
        $token = $subtree;
        switch ($token->type) {
            case MySQLLexer::EOF: return new SQLiteExpression([]);
            case MySQLLexer::IDENTIFIER:
                return SQLiteTokenFactory::identifier(
                    SQLiteTokenFactory::identifierValue($token)
                );

            default:
                return SQLiteTokenFactory::raw($token->text);
        }
    }

    switch($rule_name) {
        case 'indexHintList':
            // SQLite doesn't support index hints. Let's
            // skip them.
            return null;

        case 'fromClause':
            // Skip `FROM DUAL`. We only care about a singular 
            // FROM DUAL statement, as FROM mytable, DUAL is a syntax
            // error.
            if(
                ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && 
                !ParseTreeTools::hasChildren($ast, 'tableReferenceList')
            ) {
                return null;
            }

        case 'functionCall':
            $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text;
            return translateFunctionCall($name, $ast[0]['udfExprList']);
    }
}

A deeper technical dive

MySQL Grammar

We use the MySQL workbench grammar converted from ANTLR4 format to a PHP array.

You can tweak the MySQLParser-reordered.ebnf file and regenerate the php grammar with the create_grammar.sh script. You'll need to run npm install before you do that.

The grammar conversion pipeline goes like this:

  1. g4 ➔ EBNF with grammar-converter
  2. EBNF ➔ JSON with node-ebnf. This already factors compound rules into separate rules, e.g. query ::= SELECT (ALL | DISTINCT) becomes query ::= select %select_fragment0 and %select_fragment0 ::= ALL | DISTINCT.
  3. Rule expansion with a python script: Expand *, +, ? into modifiers into separate, right-recursive rules. For example, columns ::= column (',' column)* becomes columns ::= column columns_rr and columns_rr ::= ',' column | ε.
  4. JSON ➔ PHP with a PHP script. It replaces all string names with integers and ships an int->string map to reduce the file size,

I ignored nuances like MySQL version-specific rules and output channels for this initial explorations. I'm now confident the approach from this PR will work. We're in a good place to start thinking about incorporating these nuances. I wonder if we even have to distinguish between MySQL 5 vs 8 syntax, perhaps we could just assume version 8 or use a union of all the rules.

✅ The grammar file is large, but fine for v1

Edit: I factored the grammar manually instead of using the automated factoring algorithm, and the grammar.php file size went down to 70kb. This one is now solved. Everything until the next header is no longer relevant and I'm only leaving it here for context.

grammar.php is 1.2MB, or 100kb gzipped. This already is a "compressed" form where all rules and tokens are encoded as integers.

I see three ways to reduce the size:

  1. Explore further factorings of the grammar. Run left factoring to deduplicate any ambigous rules, then extract AB|AC|AD into A(B|C|D) etc.
  2. Remove a large part of the grammar. We can either drop support for hand-picked concepts like CREATE PROCEDURE, or modularize the grammar and lazy-load the parts we actually need to use. For example, most of the time we won't need anything related to GRANT PRIVILIGES or DROP INDEX.
  3. Collapse some tokens into the same token. Perhaps we don't need the same granularity as the original grammar.

The speed is decent

The proposed parser can handle about 1000 complex SELECT queries per second on a MacBook pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term.

If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.

Next steps

These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:

  • Bring in a comprehensive MySQL queries test suite, similar to WHATWG URL test data for parsing URLs. First, just ensure the parser either returns null or any parse tree where appropriate. Then, once we have more advanced tree processing, actually assert the parser outputs the expected query structures.
  • Create a MySQLOnSQLite database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build a MySQLOnPostgres driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?
  • Port MySQL features already supported by the SQLite database integration plugin to the new MySQLOnSQLite driver. For example, SQL_CALC_FOUND_ROWS option or the INTERVAL syntax.
  • Run SQLite database integration plugin test suite on the new MySQLOnSQLite driver and ensure they pass.
  • Rewire this plugin to use the new MySQLOnSQLite driver instead of the current plumbing.

adamziel avatar Aug 17 '24 10:08 adamziel

@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes

adamziel avatar Sep 13 '24 11:09 adamziel

I spent a week looking further into this, and I'd like to summarize my progress.

Summary

  1. I wrote a script to extract MySQL queries from mysql-server/mysql-test (link). I used it to scan the mysql-test/t directory (link), which, after a basic deduplication, yields over 66,000 queries.
  2. I run @adamziel's lexer and parser on the dataset, and without any modifications, I got about 3,900 crashes (5%) and parsing 6,500 failures (9%). This is a wonderful score, considering that the AI-generated lexer is incomplete and buggy, and I would say that this result by itself already shows that this approach appears to be viable.
  3. After implementing missing lexer parts, fixing some lexer issues, and solving some grammar conflicts, I got the numbers down to 0 crashes and about 2,500 failures (3%). It is likely that it's only a matter of a few more fixes to get the number of parsing failures closer to 0 as well. Additionally, the query-scanning script doesn't yet handle some special constructs and SQL modes, which likely causes some failures as well.

Overall, I think @adamziel's approach is a great and viable one! 👏

Parsing mysqltest

MySQL server tests are written using The MySQL Test Framework which usesa a custom DSL called mysqltest Language.

I implemented a simple script that attempts to skip all mysqltest-specific commands and preserves only SQL queries. The script consumes the .test files and looks for a command delimiter. It is quite simple, but it needs to handle some specifics:

  1. There is a DELIMITER <delimiter> command that dynamically changes the delimiter itself.
  2. The delimiter must be ignored inside strings.
  3. The test files can have line comments.
  4. Some queries are expected to fail, which is marked by a special error command.
  5. Queries like SET SQL_MODE = ... change the SQL mode for all subsequent queries, which can affect lexing and parsing.
  6. Some commands can have multi-line arguments that are delimited by a dynamically set EOF sequence.

The current implementation doesn't handle the last two points yet, but both of them seem pretty straightforward to implement when needed.

The extracted queries are then deduplicated for exact matches and dumped in a CSV format to preserve newlines within SQL queries.

In the future, we can also expand the data set to extract queries from other directories within the test suite.

Lexer issues

The AI-generated lexer is surprisingly good and well-structured, but it does have some issues and missing pieces:

  1. Over 40 lexer symbols were simply missing. Previously, @adamziel already added them to the list of symbols, and I also added their implementation.
  2. There were some small issues with parsing alphanumerical and digit-based strings. Easy to fix.
  3. Binary and HEX strings in format x'abcdef' and b'010111' were not supported.
  4. The AI also hallucinated a bit and invented some non-existent version-based conditions.
  5. The lexer wasn't able to recognize charset definitions.
  6. A method to determine whether a non-reserved keyword is a function call or an identifier was not implemented.
  7. Handling of CURRENT_TIMESTAMP , LOCALTIME, and LOCALTIMESTAMP was implemented incorrectly. Fun fact — while all of these functions are synonyms of the NOW() function, they are reserved keywords and can be used without parentheses, while NOW is a non-reserved keyword, and must be used with parentheses (otherwise, it would be an identifier). This is now handled correctly.

There's likely to be some more issues discovered in the lexer, but I wouldn't expect running into any blockers. I'd like to do a full manual walkthrough to ensure the lexer fully corresponds to the MySQLLexer.g4 specification.

Grammar issues

The converted and compressed grammar originally comes from the MySQLParser.g4 specification, and it's almost equivalent. Although I discovered some small issues in the original specs and the grammar conversion as well, these are not any major problems:

  1. The "alterOrderList" rule had a slightly incorrect definition in the original grammar.
  2. There was a rule that was incorrectly converted from ANTLR to EBNF.
  3. The "castType" seemed to be incomplete in the original grammar.

Grammar conflicts

What can pose more significant challenges is grammar conflicts. The original ANTLR grammar is meant to be processed by the ANTLR toolkit that does a lot of heavy-lifting in terms of conflict resolution. During the grammar compilation, the toolkit can refactor and reorder some rules, or introduce lookaheads into the generated parser. In our case, we don't have such a compiler at hand, which results in some conflicts manifesting when parsing.

To very briefly explain how conflicts occur, it is worth summarizing how the parser works:

  1. The grammar defines a set of rules (non-terminals) and symbols (terminals) that can be represented in the tree structure. In the tree, rules are the internal nodes and symbols are the leaves. Each rule can expand into one or more sequences of rules or symbols (its children).
  2. The parser starts at the root of the tree and tries to match the input on the first child branch. If it doesn't match, it will try the second branch, etc. If no match is found for the current input using the chosen path, the parser won't backtrack to try other alternatives. In other words, if part of the input matches a specific subtree, but the parent rule fails to match the whole input, the parser does not reconsider previous decisions. This can lead to a parsing conflict.

This behavior is correct (backtracking can be extremely expensive). In fact, it is a description of a simple LL parser. In this scenario, the conflicts can occur in the following ways:

  1. FIRST/FIRST conflict — when different rules share the same terminal prefix.
  2. FIRST/FOLLOW conflict — when a rule can be skipped (ε), and the next input could fit either that rule or the one after it.

A special case of a FIRST/FIRST conflict is left recursion, but that was already eliminated by @adamziel. Additionally, grammars can also contain ambiguities (multiple different parse trees match the same input), but I wouldn't expect running into that in this case.

As for the above-mentioned conflicts, I did run into some indeed:

  1. SELECT ... INTO @var was matched by a branch for a simple select (without INTO), which made the full query fail to parse. For now, I hardcoded a manual lookahead to fix this.
  2. Non-reserved keywords in MySQL can be matched as identifiers in the grammar. However, identifiers can never take precedence over these keywords. For instance, it is valid to use ROWS as an identifier (e.g, in an alias), but in some places (e.g., OVER (ROWS ...) it must be a keyword. In these cases, it helped to reorder some grammar rules — one, two, three, where the last one also had a first/first conflict.

Solving the conflicts manually and reordering the rules is probably not the ideal final solution, but it can help us understand the nature and quantity of these conflicts. As a next step, it might be worth addressing the conflicts by writing a simple script to analyze the grammar, but it's good to first understand what we actually need to solve. Ultimately, we should probably do some left-factoring, use lookaheads, and maybe try to build parsing tables to see if they could have a reasonable size in some form.

Some of the tooling seem to be provided by enbfutils, but I had little luck addressing any of the conflict using those tools. At least, I wrote a simple script to dump conflicts from the expanded JSON grammar.

Ultimately, I think when we understand the conflicts we're running into and how we want to address them, writing an algorithm to do so won't be a difficult task. I'll keep exploring a bit in this area.

Next steps

As for the next steps, I think it would make sense to focus on the following:

  1. Manual pass of the lexer and comparison with MySQLLexer.g4.
  2. Some more investigation into the reasons of the remaining parsing failures.
  3. Bringing this branch to a mergeable state to close the initial prototyping phase. The parser wouldn't be used anywhere just yet, but it would be a milestone.

What do you think?

JanJakes avatar Sep 30 '24 13:09 JanJakes

This. Is. Incredible. Such a good work here @JanJakes!

A few thoughts I had as I was reading:

Solving the conflicts manually and reordering the rules is probably not the ideal final solution, but it can help us understand the nature and quantity of these conflicts.

Manual resolution is not a general solution over all possible parsers we could generate here, but it should be fine for the MySQL parser. I can only think of two scenarios where that would be a problem:

  1. There's too many conflicts to reasonably address them manually
  2. The MySQL syntax will keep evolving in a direction that introduces more of these conflicts, making the maintenance more difficult

An automated script would surely help us with the latter, but perhaps that work doesn't need to happen before the initial SQLite translation layer – what do you think?

As a next step, it might be worth addressing the conflicts by writing a simple script to analyze the grammar

Another thought – once we have a generalized solution, we could generate more parsers this way. I wanted a reasonably fast Markdown parser in PHP for a while (although that's totally out of scope here :-))

I got the numbers down to 0 crashes and about 2,500 failures (3%).

This is lit 🔥

adamziel avatar Sep 30 '24 16:09 adamziel

@adamziel Well, I'm only building on your phenomenal foundations 😊

An automated script would surely help us with the latter, but perhaps that work doesn't need to happen before the initial SQLite translation layer – what do you think?

💯 I think in the first phase, we should be able to go with the manual fixes. Doing it properly and automatically could be a rabbit hole, so I would avoid going that way for now if we can. The fact that we got to a 3% error rate on an extensive test suite is promising.

Actually, even more promising, given the fact that are still some issues in the lexer. I've done the first pass of semi-manually comparing the lexer to the MySQLLexer.g4 grammar, which already yielded some nice results:

  1. More correctness (some more test failures eliminated).
  2. Nice size reduction (9460 LOC to 3828 LOC) while preserving performance.

What's promising is that this is only a part of the manual pass. I sorted out all tokens, function identifiers, synonyms, and version-specifics, but there are things I haven't addressed yet, and I know some of them need to be fixed. For instance, the AI used all the ctype functions, but that doesn't correctly represent the full range of allowed MySQL identifiers.

In summary, there's a chance even more test failures will be eliminated just by improving the lexer, hopefully leaving us with fewer interventions on the grammar side.

JanJakes avatar Oct 01 '24 15:10 JanJakes

I had a conversation with @JanJakes about this and just wanted to note the idea (whether feasible or not) whether we might be able to transform the MySQL AST into a SQLite AST before we transform it back into a SQL query string and pass it on to SQLite. I found for example this (JS) project that creates SQL from an SQLite AST.

akirk avatar Oct 02 '24 09:10 akirk

Another thing that's worth mentioningis the SHOW PARSE_TREE statement in MySQL. These commands work with a debug build of MySQL, and it could be useful if we want to assert the parser correctness in the future. We would need to do some kind of mapping to our AST, but I suppose it could work.

JanJakes avatar Oct 02 '24 11:10 JanJakes

I had a conversation with @JanJakes about this and just wanted to note the idea (whether feasible or not) whether we might be able to transform the MySQL AST into a SQLite AST before we transform it back into a SQL query string and pass it on to SQLite. I found for example this (JS) project that creates SQL from an SQLite AST.

I'm not saying no. Perhaps the added complexity of modeling another AST would still be worth it in the end. That being said, I have a few concerns:

  • That would require two steps (AST->AST, AST->string) instead of one (AST->string).
  • Modeling SQLite AST is another challenge, we'd need to source SQLite grammar from somewhere.
  • Arrays are expensive in PHP. It may not be the bottleneck here, but more trees = more arrays for each query
  • A single MySQL query may require 10 SQLite queries and multiple roundtrips to the database
  • Would we have to follow that same approach in the future PostgreSQL driver? That would be another AST to model.

adamziel avatar Oct 02 '24 14:10 adamziel

@adamziel The idea is interesting, but you've raised many good points. It seems like it's definitely more than a simple AST transformation, and I find especially the following point to be a very important one:

A single MySQL query may require 10 SQLite queries and multiple roundtrips to the database

Maybe in the first phase, we should focus on a simple AST -> string transformation.

JanJakes avatar Oct 04 '24 13:10 JanJakes

Just a quick update:

  • I focused mainly on the lexer now. Many fixes, sorted out all the tokens according to the grammar, fixed handling unicode, identifiers, numbers, nchars, different edge cases, etc. The lexer is now more than 3x smaller and lexing time of all the 66k queries went from 1.6-1.7s to under 1s on my MacBook. I would say the lexer is almost done.
  • I did a few more grammar fixes when I noticed them during some testing. I didn't focus mainly on them, there is still some for a few grammar fixes to reduce the failure set.
  • I downgraded the test suite to MySQL version 8.0.38 since that's the version that the MySQL Workbench grammar seems to be tagged for. I wonder if this means they don't really update it that often/anymore.
  • I'll be on a meetup next week. Depending on how I'm busy, I'll see if I'll be able to do any work.
  • When I'm back, I think we could wrap this up and focus on the query translation layer. I think we don't need to get to 0 errors yet; we can always return to fixing the last parser cases later on. Let me know what you think!

One thing I realized is that while the lexer takes MySQL server version into consideration, the parser doesn't. This may result in some edge cases where the lexer may provide tokens in some ways that the parser may not necessarily expect. It is something to think about in the future — maybe we can create or own ANTLR -> EBNF/JSON conversion script to preserve the version information.

JanJakes avatar Oct 04 '24 19:10 JanJakes

Amazing progress @JanJakes hats off!

One thing I realized is that while the lexer takes MySQL server version into consideration, the parser doesn't.

I wonder if we can look over that for now and add support for version-specific parsing later on.

Today, our parser is a mapping: (Grammar, Query) -> ParseTree.

Version-specific parsing would be another mapping: (MySQLVersion) -> Grammar.

It seems like we could find that second mapping in the future without much risk of building a system on top of the wrong abstraction?

We could account for that in the API design by requiring a $mysql_version argument to construct the parser and throwing if it's different than the specific version we support. Perhaps a single MySQL version would get us 95% where we need to be?

What do you think? Am I missing anything?

adamziel avatar Oct 05 '24 17:10 adamziel

When I'm back, I think we could wrap this up and focus on the query translation layer. I think we don't need to get to 0 errors yet; we can always return to fixing the last parser cases later on.

I like that idea.

My only worry is building a translation layer only to discover a class of errors that cannot be fixed without a substantial re-architecture of the parser.

If we can rule that out, I'm all for focusing on the database driver. I think it is a driver, because we can't translate some queries without heavily interacting with the database, therefore a translate() would be a wrong abstraction. We can only implement a query() method.

I think a PDO-compliant interface would be highly useful. What do you think? It could be a drop-in replacement for many non-WordPress apps.

adamziel avatar Oct 05 '24 17:10 adamziel

Total: 66636 | Failures: 818 / 1% | Exceptions: 0 / 0%

This makes me happy

adamziel avatar Oct 16 '24 19:10 adamziel

The g4 parser — how difficult would it be to express g4 syntax using EBNF grammar? We could then generate a parser for those ANTLR files the same way we parse MySQL queries. With that PHP-native ANTLR parser, we could:

  • Have a generic AntlrParser class that accepts the ANTLR grammar as text.
  • Compile ANTLR files directly as small PHP parsers, I'd love to try parsing Markdown that way
  • Skip the intermediate JSON representation (unless it's useful for debugging)

And maybe it's too much effort and distraction. I just wonder.

adamziel avatar Oct 16 '24 22:10 adamziel

I owe an update here. The lexer is mostly done, and I've been focusing on the parser now. I dug a bit deeper to find out whether the current error rate is fine to move on to SQL translation, or if we should fix most/all queries first.

Here's the progress and some observations:

Script to parse the ANTLR grammar

As I was digging deeper into the grammar, I observed the following:

  1. It was rather hard to do more complex fixes in the converted BNF grammar, where each rule is formatted on a single line.
  2. I discovered an issue during the ANTLR -> EBNF conversion, and suspected there might be more.
  3. The ANTLR -> EBNF conversion didn't preserve the MySQL version specifiers in the original grammar.
  4. The conversion required Node.js & NPM deps, Python & PIP deps, and PHP as dependencies.

Therefore, I spent about a day trying to write a simple script with some recursive regexes to parse the ANTLR grammar myself. It turned out to be doable. The resulting script is under 200 LOC and does all the necessary steps (parsing, flattening, expanding ?, +, * quantifiers into subrules, and converting to the expected format).

Additionally, it handles the above-mentioned issue with empty branches correctly, and preserves version specifiers inside the grammar. Going forward, this could also unlock new possibilities in processing the grammar — it may be easier now to try to detect/resolve conflict in an automated way, generate lookup tables, etc.

The script is rather quick and dirty, but it does the job. In the future, we could come with a more robust solution (like @adamziel's great idea).

Handling MySQL versions

Preserving version specifiers means that the resulting grammar now includes a version attribute for relevant rules:

{
  "name": "%selectOption30",
  "bnf": [
    [
      "MAX_STATEMENT_TIME_SYMBOL",
      "EQUAL_OPERATOR",
      "real_ulong_number"
    ]
  ],
  "versions": "serverVersion >= 50704 && serverVersion < 50708"
},

This will allow users to specify a MySQL version and us respect the version in both the lexer and the parser.

This also highlights an important advantage of using the MySQL Workbench grammar rather than the official MySQL Server one (that would be harder to extract) — the MySQL Workbench grammar defines a single grammar for multiple MySQL versions at once, while with the official MySQL Server one has a whole grammar specification per each version (as tagged in the repository). For this reason, I think the ANTLR grammar from MySQL Workbench was a great choice.

Fixing the grammar

We're now approaching the last 500 test failures, and most of the improvements are now done by conflict resolutions and fixes in the grammar itself. There are these types of issues I'm encountering:

  1. Grammar conflicts — as already mentioned, we don't have any lookup tables or other automated mechanism to address those, but it's not hard to verify and fix them manually, one by one. Often, it's about reordering the rules.
  2. Grammar errors — while the MySQL Workbench grammar covers a lot, there are some incomplete rules, some rules with small mistakes, and some parts are even missing.

Since we're so close to 0 errors already, I think continuing this approach is the way to go for now. In the future, we could provide some of the fixes to MySQL Workbench upstream and/or start to maintain the grammar ourselves.

Other fixes

While going through the query failures, I sometimes run into invalid ones. This is because the mysqltest Language is fairly rich, but I'm only using a simple script to parse it, so it's not 100% correct.

For now, I'm fixing these issues by improving the script. In the future, we could consider parsing the tests properly. After all, we have a universal parser... you see where I'm going with this 😊 At the moment, I don't think this is a priority.

Next steps

Being so close to 0 failures, I think it makes sense to inspect the few remaining ones. After that, it's time to wrap up this branch and move on to query translation.

JanJakes avatar Oct 17 '24 15:10 JanJakes

@adamziel Sorry for being so late with some of your questions.

I think this one is already resolved now:

I wonder if we can look over that for now and add support for version-specific parsing later on.

Today, our parser is a mapping: (Grammar, Query) -> ParseTree. Version-specific parsing would be another mapping: (MySQLVersion) -> Grammar.

It seems like we could find that second mapping in the future without much risk of building a system on top of the wrong abstraction?

We could account for that in the API design by requiring a $mysql_version argument to construct the parser and throwing if it's different than the specific version we support. Perhaps a single MySQL version would get us 95% where we need to be?

What do you think? Am I missing anything?

As mentioned in the update above, if we preserve the version specifiers from the original ANTLR grammar, then we could use them quite easily at runtime, either when loading the grammar, or when parsing the input. At the moment, I preserve the specifiers up to the JSON representation, but if we propagate them to the PHP representation, then it's straightforward to use them.

I realized this is a major advantage of using the ANTLR grammar from MySQL Workbench — the grammar can capture multiple MySQL versions at once.


The g4 parser — how difficult would it be to express g4 syntax using EBNF grammar? We could then generate a parser for those ANTLR files the same way we parse MySQL queries. With that PHP-native ANTLR parser, we could:

Have a generic AntlrParser class that accepts the ANTLR grammar as text. Compile ANTLR files directly as small PHP parsers, I'd love to try parsing Markdown that way Skip the intermediate JSON representation (unless it's useful for debugging) And maybe it's too much effort and distraction. I just wonder.

Amazing thinking, I love this! I'll try to avoid following that distraction immediately, but we could spend a day on it later on, it's a great idea. As for skipping the JSON representation, yes! It's not really useful, you can always dump a PHP array in any format for debugging. I even wanted to merge the phpize-grammar part with my new script, but didn't get to that yet.

JanJakes avatar Oct 17 '24 15:10 JanJakes

Brilliant work here @JanJakes ❤️

adamziel avatar Oct 21 '24 10:10 adamziel

Wrap up

I think it's time to wrap up this pull request as a first iteration of the exhaustive MySQL lexer & parser, and start looking into the next phase — the SQL query translation.

From almost 70 000 queries, only 11 are failing (or invalid) now. There's likely to be more edge-case issues and missing version specifiers, but we can focus on improving the correctness later on. In part, we could use some automated checks.

This PR ships

  1. A MySQL lexer, adapted from the AI-generated one by @adamziel. It's over 3x smaller and close to 2x faster.
  2. A MySQL grammar written in ANTLR v4 format, adapted from the MySQL Workbench grammar by adding and fixing some cases and reordering some rules.
  3. A script to factor and convert the grammar to a PHP array. There are no 3rd party deps now, it's just a PHP script.
  4. A dynamic recursive parser implemented by @adamziel.
  5. A script to extract tests from the MySQL repository.
  6. A test suite of almost 70k queries.
  7. WIP SQLite driver by @adamziel, a demo and foundation for the next phase.

At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever.

Running tests

The lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use:

php tests/parser/run-lexer-tests.php
php tests/parser/run-parser-tests.php

This will lex / lex & parse all the ~70k queries.

Known issues

There are some small issues and incomplete edge cases. Here are the ones I'm currently aware of:

  1. A very special case in the lexer is not handled — While identifiers can't consist solely of numbers, in the identifier part after a ., this is possible (e.g., 1ea10.1 is a table name & column name). This is not handled yet, and it may be worth checking if all cases in the identifier part after a . are handled correctly.
  2. Another very special case in the lexer — While the lexer does support version comments, such as /*!80038 ... / and nested comments within them, a nested comment within a non-matched version is not supported (e.g., SELECT 1 /*!99999 /* */ */). Additionally, we currently support only 5-digit version specifiers (80038), but 6 digits should probably work as well (080038).
  3. Version specifiers are not propagated to the PHP grammar yet, and versions are not applied in the grammar yet (only in the lexer). This will be better to bring in together with version-specific test cases.
  4. Some rules in the grammar may not have version specifiers, or they may be incorrect.
  5. The _utf8 underscore charset should be version-dependent (only on MySQL 5), and maybe some others are too. We can check this by SHOW CHARACTER SET on different MySQL versions.
  6. The PHPized grammar now contains array indexes of the main rules, while previously they were not listed. It seems there are numeric gaps. It might be a regression caused when manually parsing the grammar. I suppose it's an easy fix.
  7. Some components need better test coverage (although the E2E 70k query test suite is pretty good for now).
  8. The tests are not run on CI yet.
  9. I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it.

This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist.

Previous updates

Since the thread here is pretty long, here are quick links to the previous updates:

Ready for review

I would consider this to be ready for review now. I fixed all coding styles, and excluded the tests for now. Additionally, all the new files are excluded from the plugin build at the moment (.gitattributes).

I guess you can now have a look, @adamziel, @brandonpayton, @bgrgicak, and anybody I missed. I'll try to incorporate any suggestions soon (likely on Monday, as I'm AFK tomorrow).

JanJakes avatar Oct 31 '24 13:10 JanJakes

Let's make sure the known issues are @TODOs somewhere in the code – that will act as a reminder and also make the limitations clear when working with the code.

adamziel avatar Oct 31 '24 13:10 adamziel

Some rules in the grammar may not have version specifiers, or they may be incorrect.

What does it mean that they can be incorrect?

adamziel avatar Oct 31 '24 13:10 adamziel

I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it.

I think we're good with PHP 7.2+ since WordPress bumped the minimal version recently. Also, the CI runner uses PHP 7.2+ so enabling the unit tests in CI may be enough to ensure compatibility.

adamziel avatar Oct 31 '24 13:10 adamziel

What does it mean that they can be incorrect?

Sometimes, there are little issues like this one or a few missing rules like this. The MySQL workbench grammar is not 100% equivalent to the original MySQL grammar in some small details. There could be more, although satisfying the large test case should make us confident we're >99% correct. That said, we could compare all the tiny details with the original MySQL server grammar, if needed, and explore if some comparison could be made automatically (e.g., fetching the MySQL server grammar in all different versions and for some tokens, etc.).

JanJakes avatar Oct 31 '24 14:10 JanJakes

@JanJakes amazing work here, this is quite a PR! I'm super excited to have such a great parser at our disposal – this enables a reliable MySQL-on-SQLite database driver where we're the only limiting factor is the amount of cases supported by the adapter – but not our ability to parse SQL. Fan-ta-stic!

I left ample comments in all the places I think we can improve. Some of them may seem like nitpicks – my intention is to have a parser class that's virtually ready for WordPress core with reliable unicode processing, intuitive naming, and good inline documentation. I understand this will delay woking on the actual driver. I still think spending some extra time here is worth it. All the decisions and context are still fresh and they'll never be easier to wrap up than now.

Sometimes, there are little issues like this one or a few missing rules like this. The MySQL workbench grammar is not 100% equivalent to the original MySQL grammar in some small details. There could be more, although satisfying the large test case should make us confident we're >99% correct.

That's absolutely reasonable, thank you for explaining!

I understand we're testing for PHP errors and parsing failures. How do we know the produced parse tree is correct? Could we also have unit tests for that? If there's no easy way to do it for the entire test set, can we do it for a limited subset of the most tricky queries?

adamziel avatar Nov 04 '24 13:11 adamziel

@adamziel Thanks for the great and detailed review! Great comments and ideas! I think I now went through all of them resolved most, replied to some, documented some with TODOs. I took a lot of iterations on the lexer, while on the parser side, I added a bit more TODOs to think through and address some of the ideas in another PR, if that makes sense.

Some of the improvements are:

  • Lexer now needs no PCRE, nor the UTF-8 decoder. I realized we only ever need to match the U+0080-U+FFFF range and added manual checks with are faster, together with tests that compare the whole range against what PCRE matches.
  • Simplified and optimized many places with strspn and strcspn in the lexer.
  • Significantly simplified the lexer state. Basically, we now only change $this->bytes_already_read. Also removed the "channel", etc.
  • Tokenizing methods in the lexer now have unified naming and API.
  • A lot of docs added (but more needed still).
  • Implemented integer type detection in the lexer (int vs. long, etc.).
  • Cleanup, naming, better docs in the parser, but also a lot of TODO comments in these classes.
  • All lexer and parser tests are now run in CI via PHPUnit.
  • Polished and documented all the scripts.

There is still more to be done, better documented, and polished, but I think now it would make sense to do that in another pass. I added TODOs for all issues I'm aware of, including naming, missing documentation, etc.

I understand we're testing for PHP errors and parsing failures. How do we know the produced parse tree is correct? Could we also have unit tests for that? If there's no easy way to do it for the entire test set, can we do it for a limited subset of the most tricky queries?

If we run a development version of MySQL, we can use the SHOW PARSE_TREE statement to inspect what parse tree MySQL produces. I haven't investigated yet to which extent we could really use it in scale (e.g., if we can somehow map it to our ASTs), but it may be something worth exploring.

Otherwise, I think we may also need to create a test suite of expected ASTs manually, and then gradually expand it.

JanJakes avatar Nov 12 '24 15:11 JanJakes

@JanJakes marvelous work ❤️ This looks really good and I don't see anything blocking. Minor nitpicks aside, we're good to merge – thank you so much for your hard work and perseverance here!

adamziel avatar Nov 13 '24 12:11 adamziel

I can't approve my own PR, but I can merge it. Just say when and I'll click the button 🎉

adamziel avatar Nov 13 '24 12:11 adamziel

@adamziel Thanks! Addressed all but https://github.com/WordPress/sqlite-database-integration/pull/157#discussion_r1840197815. When I'm sure I know what we want here, I can fix that too.

JanJakes avatar Nov 13 '24 15:11 JanJakes

This PR is in a great place, monumental work here @JanJakes!

adamziel avatar Nov 18 '24 11:11 adamziel