sqlite-database-integration icon indicating copy to clipboard operation
sqlite-database-integration copied to clipboard

AST-based SQLite driver

Open JanJakes opened this issue 1 year ago • 18 comments

This PR is part of the Advanced MySQL support project and builds on the previous work in Exhaustive MySQL parser.

This is a WIP, currently stacked on top of https://github.com/WordPress/sqlite-database-integration/pull/157.

JanJakes avatar Nov 14 '24 14:11 JanJakes

@adamziel Looking deeper into this, I think that translating specific AST subnodes is pretty straightforward, but what's an open question is the driver/translator architecture, and I'm kind of going back and forth on multiple options.

Splitting the driver & translator I'm not sure whether having all in one large class is a good idea, and I'm exploring ways to split the code a bit. That said, there is the large switch/case statement in the prototype that doesn't need to know much about the AST structure — it just handles some nodes specifically, and passes many other ones down to the translator again, recursively. This has some advantages, such as we don't need to worry much about where a specific grammar rule can appear.

Splitting top-level constructs That said, some kind of distinction makes a lot of sense to me, at least at the highest level. There is a big difference whether we're running a SELECT statement, or an UPDATE statement, or CREATE TABLE, etc. These top-level statements have pretty different semantics, they can return different things, and some special constructs make sense only in some cases (e.g., SQL_CALC_FOUND_ROWS works only with SELECT statements). Therefore, I'm pretty confident that on the top level, it makes sense to split these use cases a bit into separate classes.

To add a bit more context, even though the top-level statement rule can be re-used in a compound statement for stored procedures, stored functions, triggers, and events, it's probably good to distinguish that case from the top-level use case. E.g., an ALTER TABLE ... statement on the top-level may result in directly executing multiple SQLite queries, while when the same is passed as a procedure body, we may need to say it's unsupported.

Splitting further? However, going a little bit deeper, any kind of distinction can be problematic. There are surely some grammar rules that can appear only in the context of others — tableElementList can never appear in the SELECT clause, for instance — but would it make any sense to implement such rules in CREATE/UPDATE translator class? Or, should there be one huge switch to handle all the grammar rules, with maybe the exception of the top-level ones, as suggested above?

Translating vs. interpreting Another way of looking at organizing the code is the difference between translating MySQL syntax into SQLite syntax vs. emulating a MySQL statement with multiple different SQLite statements or PHP calls. We can't completely separate these two, but I think it would be nice to be clear about where we are just translating, and where some logic is actually executed. Again, it seems to me that the top-level vs. deeper level could help here — each top-level statement could be handled by a "handler" class that would interpret the query in full, and these handlers could use a generic "translator" within them, whose task would only be to translate some MySQL constructs into SQLite syntax.

class WP_SQLite_Create_Table_Translator {
	public function translate_create_table( WP_Parser_Node $ast ) {
		// CREATE TABLE ... LIKE ...
		// CREATE TABLE ... (LIKE ...)
		$has_like = $ast->get_descendant_token( WP_MySQL_Lexer::LIKE_SYMBOL );
		if ($has_like) {
			// Here we will need to run a query to get the table structure,
			// and then create the table:
			return $this->create_table_like( $ast );
		}

		// CREATE TABLE ... ( elements ) options
		// CREATE TABLE ... AS ...
		$elements_ast = $ast->get_child( 'tableElementList' );
		$has_as = $ast->get_descendant_token( WP_MySQL_Lexer::AS_SYMBOL );
		if ( $elements_ast && $has_as ) {
			throw new \Exception('"CREATE TABLE" with both column/constraint definition and "AS" clause is not supported');
		}
		
		// SQLite supporst both ( elements ) and "AS", so we can just rewrite
		// the query and run it. Rewriting can be done either partly here,
		// or fully in the translator.
		$query = $this->translator->translate( $ast );
		return $this->connection->query( $query );

		// Or, we rewrite the top-level `CREATE TABLE` here and nested nodes in the translator:
		$query = new WP_SQLite_Expression(
			array_merge(
				array( WP_SQLite_Token_Factory::raw( 'CREATE TABLE' ) ),
				array(
					$this->translate(
						$ast->get_child( 'tableName' )->get_child()
					),
				),
				array( WP_SQLite_Token_Factory::raw( '(' ) ),
				$this->translator->translate( $ast->get_child( 'tableElementList' ) ),
				array( WP_SQLite_Token_Factory::raw( ')' ) )
			)
		);
		return $this->connection->query( $query );

I'm not yet sure what's the optimal solution, but it feels like at least a bit of a concern separation would be useful. It would also be nice if we could solidify some AST node translation by unit testing the translated fragment as well (not only by the end-to-end tests that are run against SQLite), as such test would give immediate feedback and great diffs.

JanJakes avatar Nov 15 '24 14:11 JanJakes

Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose?

Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated.

Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite. Why? Because we don't really need modular code here. I don't think we'll ever need to use, say, SELECT statement translation as a component in any other context – therefore is it really useful to separate it?

I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all.

The HTML Tag processor, for example, is just a single class even though it tackles a complex problem. It's a huge advantage – most methods are private so we can adjust them without breaking BC, you never have to analyze a dependency graph or wonder where should a specific piece of logic go, you always have full access to the entire state, and you can extend it easily.

I've tried splitting that logic a few times in the current iteration of the plugin and the outcome was always messy. Making a distinction between translating and interpreting is tempting, but it complicates the design. Perhaps things are different now that we have AST – if yes, then we should hit the wall really soon with a single class approach.

adamziel avatar Nov 15 '24 18:11 adamziel

@adamziel Thanks for your insights! I did not consider that separation of concerns also creates new public APIs as a side effect. That's a very important point. Your proposal makes a lot of sense to me. Somehow, it seemed natural to me that the HTML Tag Processor, or the MySQL lexer, for instance, is handled by a single class, but I wasn't sure about the driver. I guess for parser-like logic, a single huge class is quite common, but in the case of the driver, I started to see too many different things (DB connection, query translation, query execution, etc.), and got myself caught overthinking it.

Is there any downside to starting with a single giant class and splitting once that no longer serves the purpose?

The only downside is that the giant class can will have multiple responsibilities, and it can get a bit difficult to read. For instance, some state variables will make sense for selects, but not for inserts, etc. But we can mitigate that with good naming, docs, and making the state as simple as possible.

Separation of concerns is a useful principle but it cuts both ways. Take Enterprise FizzBuzz. Great separation of concerns, extremely unreadable codebase. Too many concerns were defined and separated.

Wow, what a codebase! 😂 And a nice demonstration of what over-architecting can cause.

Therefore, I propose starting with just a single concern: Running MySQL queries in SQLite.

💯

I only propose that approach as a starting point. Perhaps it will make things unnecessarily complex and we'll have to look for other ways to slice it. I'm just not too excited about preemptively decoupling things that may not need decoupling at all.

Makes sense!

It's a huge advantage – most methods are private so we can adjust them without breaking BC

This! 💯


Looking at the architecture, I'd like to simplify also the current draft with WP_SQLite_Expression and WP_SQLite_Query_Builder. I think maybe we could make it work without them.

The WP_SQLite_Query_Builder takes the expression and stringifies it, joining all strings with a space and wrapping sub-expressions with parentheses. It covers many use cases out of the box, but it's not really a definition of the SQLite SQL dialect or anything like that, and it's not very flexible — for instance, a list of column definitions is delimited by a comma, not a space, and maybe the current implementation would put spaces even into places where they aren't expected. Apart from that, it performs some operations on some tokens (quoting identifiers, for instance), which could be done by the driver itself.

For WP_SQLite_Expression, it may be possible to replace it within the driver with return types of the recursive translate calls — either we return a token, or we could return an array (~ the expression). In any case, if we'll need to keep this one for some reason, I'd consider renaming it to something like WP_SQLite_Node to avoid confusion with SQL Language Expressions or any hints that it maps to some SQLite construct.

I will quickly try out these simplifications to see if they could make sense.

JanJakes avatar Nov 18 '24 09:11 JanJakes

@adamziel Another thought regarding WP_Parser_Node: We now provide child and descendant methods. Should we also provide parent methods? It can be useful, but I think it has a major flaw. I'll explain:

Sometimes, it could be useful in the driver, if we want to translate some constructs in a way "if I'm in the context of createTable, do this, otherwise do that". This is a simple way to express some specific nuances, but it's not the only way — we could also handle it from the top within the createTable node instead, constructing the statement a bit more top-down (but it can be more verbose at times).

That said, I think there's a problem — adding parent references to WP_Parser_Node would probably cause memory leaks. PHP GC uses reference counting, which means that it only frees the memory occupied by an object when there are no more references to it, and that circular parent-child references would never allow for garbage collection. Weak refs were only introduced in PHP 7.4, so I guess currently, it doesn't make sense to do this. It also complicates the tree a bit and makes subtrees not "independent". There is the alternative of saving some context (node stack) in the driver during the translation, for instance.

JanJakes avatar Nov 22 '24 15:11 JanJakes

@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"?

adamziel avatar Nov 23 '24 09:11 adamziel

Also the HTML and the XML API solve the same problem with the get_ breadcrumbs a method that gives you a stack of parent elements names. Maybe the same would use here for AST notes?

adamziel avatar Nov 23 '24 09:11 adamziel

@JanJakes what would be a specific code example where a parent reference would help? I get "do this, do that," but what's "this" and "that"?

@adamziel For now, I've run into the INTEGER PRIMARY KEY "without" vs "with" autoincrement — in the former case, I'd like to translate it as INT PRIMARY KEY while in the latter one as INTEGER PRIMARY KEY due to the "bug" that's preserved in SQLite for back compatibility. In the AST, it would look something like this:

createStatement
   |- createTable
     |- tableElementList
       |- tableElement          <-- Here I can reliably determine PRIMARY KEY
       |  |- columnDefinition
       |    |- fieldDefinition  <-- Here I can reliably determine AUTOINCREMENT
       |     |- dataType        <-- Here I want to know about PRIMARY KEY and AUTOINCREMENT
       |     |- columnAttribute <-- Here PRIMARY KEY and AUTOINCREMENT can be specified
       |     ...
       |- tableElement
         |- tableConstraintDef  <-- But PRIMARY KEY can also be specified here (AUTOINCREMENT not)

Without the tableConstraintDef option for primary keys, it would be simple-enough to handle this case in fieldDefinition in a top-down manner. Something like:

case 'fieldDefinition':
	$has_primary_key   = ...;
	$has_autoincrement = ...;
	if ( $has_primary_key ) {
		$children       = $ast->get_children();
		$data_type_node = array_shift( $children );
		$data_type      = $this->translate( $data_type_node );
		if ( 'INTEGER' === $data_type ) {
			$data_type = $has_autoincrement ? 'INTEGER' : 'INT';
		}
		return $data_type . $this->translate_sequence( $children );
	}
	return $this->translate_sequence( $ast->get_children() );

Here, the $has_primary_key can peek below for descendant token, but it has no way to verify the tableConstraintDef, that would be in a different tableElement. I would either need a reference to the parent tableElement to use it to search for a primary key, or I would need to save it somehow ($this->set_context('primary_key', ...)) and then read it below.


However, in the case of CREATE TABLE, I'm also wondering whether it may make more sense to fully parse the create table statement (into columns, constraints, etc.), and only then construct a query from that, like we do in the current translator. It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

JanJakes avatar Nov 23 '24 14:11 JanJakes

Two ideas I got to avoid a parent reference:

  • Could the driver or parser class keep an array like $child => $parent and expose a method such as get_parent_of?
  • The query execution process could keep track of the stack of nodes it entered

adamziel avatar Nov 23 '24 16:11 adamziel

It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries.

adamziel avatar Nov 23 '24 16:11 adamziel

It's granular changes on specific AST nodes vs reading the full MySQL statement and generating a new SQLite statement from that. I don't know yet which of these options will be ultimately more convenient and reliable.

Are these actually different? You need to traverse the tree node by node in either case. Or would the latter mean doing one pass to build a flat index of, say, modifiers and options, and then doing the translation in the second pass? Because I don't think that index could be flat - we'd still need a recursive data structure to account for subqueries.

@adamziel I think they are a bit different:

  1. The first approach with only the necessary granular changes often relies on the default case that just recursively translates all children and concatenates them with a space. At the moment, this often works thanks to the similarity of the two SQL dialects, so even tokens are reused (SELECT, parentheses, etc.). In a way, we're just taking the MySQL query and doing some adjustments, where needed. Over time, we'll probably cover the translation more and more on each node, and will rely a bit less on the shape of the original MySQL query.
  2. The second approach would be oriented more towards reading data and then constructing a whole new query, rather than doing only the necessary changes. The index of the elements could be flat, at least in this case. More generally, even with this approach, the recursion is still possible — if I know that I have a list of some definitions, I can still get them with $definition = $this->translate( ... ), in case they can contain expressions, subqueries, etc.

Looking at CREATE TABLE in particular, and the grammar branch that contains tableElementList (the last below),

createTable:
    TEMPORARY_SYMBOL? TABLE_SYMBOL ifNotExists? tableName (
        LIKE_SYMBOL tableRef
        | OPEN_PAR_SYMBOL LIKE_SYMBOL tableRef CLOSE_PAR_SYMBOL
        | (OPEN_PAR_SYMBOL tableElementList CLOSE_PAR_SYMBOL)? createTableOptions? partitionClause? duplicateAsQueryExpression?
    )
;

we would be able to get a list of all column and constraint definitions, organize them somehow, and then generate the new CREATE TABLE query. The question is whether we should do that, or if we should just go with option 1.


Here's maybe a good example for thinking about this:

A PRIMARY KEY constraint can be defined either within a particular column definition, or in a different table element, separately from the column definition. Now:

  1. With option 1, we'd keep the definition where it was in MySQL, not normalizing it in any way, and relying on the fact that both MySQL and SQLite support defining PRIMARY KEY in both places. When we would run into a specific constraint that can be defined in both places in MySQL, but only in a single one in SQLite, we'd just handle that specific case.
  2. With option 2, we would first create a list of all columns and constraints, and then generate a query from that. Inline constraints would be normalized to separate constraint definitions (or the other way round), etc.

As a consequence, option 1 fixes only the particular cases that SQLite supports differently, leaving the rest mostly to their similarities. If there is a specific keyword or modifier we need to ignore, it needs to be done explicitly.

With option 2, we would be picking and allow-listing the things we support, possibly ignoring any other ones. In many cases, I think option 2 may be a bit more verbose, unless there are many exceptions and edge cases that would have to be handled in option 1.

Ultimately, we can do a little bit of both — sometimes we can be more descriptive for a specific node, we can decompose it and fully reconstruct in an option 2 fashion, while still leveraging option 1 as well by translating some subnodes recursively. The decision may be needed on a case-by-case, but it is a question of what would be a useful guideline to deciding this.


And here's a particular example:

In MySQL, a PRIMARY KEY with AUTOINCREMENT can be defined in both of these ways: a) CREATE TABLE t (id INT PRIMARY KEY AUTO_INCREMENT); b) CREATE TABLE t (id INT AUTO_INCREMENT, PRIMARY KEY (id) );

While in SQLite, b) is a syntax error. It means that we have to rewrite it to something like a) because AUTOINCREMENT in SQLite can be only in the column definition and only after a PRIMARY KEY constraint. And here we're touching on the questions — handling this particular case vs reading and normalizing the full statement.


Another interesting example:

  • In MySQL, we can write both PRIMARY KEY AUTO_INCREMENT and AUTO_INCREMENT PRIMARY KEY.
  • In SQLite, it always needs to be PRIMARY KEY AUTOINCREMENT, the other one is a syntax error.

Going back to the two options, with option 1, we need to catch and handle this case, while with option 2, it would probably just work due to the full statement reading and normalization, at the likely cost of more verbosity.

Ultimately, it may also work that we only normalize the PRIMARY KEY & AUTOINCREMENT definitions, leaving the rest as-is. At the moment, I don't know yet if there will be more of such cases in this particular statement.

JanJakes avatar Nov 25 '24 09:11 JanJakes

Fantastic explanation @JanJakes, thank you so much for making the distinction so clear!

I don't know the answer. We'll have to discover it through exploration. In my attempts, walking over the AST and manipulating the resulting query seemed to be quite handy.

Translating the MySQL AST into another intermediate data structure seemed cumbersome, but perhaps iterating over the AST and storing metadata would be helpful, as in $table_ast_node['metadata']['primary_key'] = ['column 1', 'column 2']. But perhaps a computation would be just as good, as in $primary_key = $this->find_primary_key_columns($table_ast_node).

Either way, it seems like we're in a multi-pass land where at least one pass is needed to scan the data and perhaps construct a query scaffold, and another pass is needed to finalize the query construction. With a computation-based approach such as $this->get_some_info($ast_node) the two passes would use co-located logic, but we'd also do more work than necessary. With a metadata tracking approach, such as $this->pk[] = $node['column_name'], we'd do explicit two passes but also distribute parts of logic that are tightly related. Tradeoffs, tradeoffs.

It would be useful to branch out, start a sub-pr for each idea, and discuss the ups and downs.

adamziel avatar Nov 25 '24 15:11 adamziel

@adamziel Looking closer, I think it also depends on the AST-querying capabilities. If we could be more specific (e.g., querying for tableElementList > tableElement > tableConstraintDef > "PRIMARY", then that could be safer and better optimized than a full descendant search. That said, this still doesn't give me "the node contains the PRIMARY keyword" and I don't have a parent reference.

When looking at the "storing metadata" vs "doing a computation" problem, I realize that storing metadata can involve a computation as well, and in the particular case of the PRIMARY KEY constraint, both options will actually involve a pretty similar one. To get the metadata, we need to look into the tableElementList subtree first, as we don't know whether the primary key constraint is defined before or after the column specification. The issue with computation-only would likely be that we would call it for every column again and again, while the metadata approach would "cache" it.

What I want to say is that if we have powerful and performant AST-querying capabilities, we could probably always do computations. Then it's a question whether such capabilities belong to WP_Parser_Node or rather something external to it (that could support even parent lookups):

$query = new WP_Parser_Query( $ast );
$query->selector( $node, ['tableElementList', 'tableElement', 'tableConstraintDef', ...] );
$query->selector_all( ... );
$query->parent( $node );
// etc.

However, this would likely mean doing a full AST pass first to store some lookup tables, or we'd need to make them lazily while querying. Also, I don't want to overengineer it, and this could go in that direction. At the same time, having better AST search capabilities at hand would be great, and it would work for any other grammar as well.

Anyway, I guess I'm throwing out here too many implementation details, but it does help to write things down to get the thoughts more organized and have a better overall view.

JanJakes avatar Nov 26 '24 13:11 JanJakes

@adamziel I took a stab at this in https://github.com/WordPress/sqlite-database-integration/pull/164/commits/797c3b78f4035f9da4262be75f55190e989a92a5. It combines a bit of top-down (execute_create_table_statement) with the recursion approach. It's not overly verbose, but the downside is that some problems are handled in multiple different nodes (the PRIMARY key & AUTOINCREMENT issue). It also handles some other MySQL vs SQLite CREATE TABLE specifics (see the comment in execute_create_table_statement).

Saving metadata into a state wouldn't change that much here, I suppose. At least not the main architecture of having to do some logic close to top-level, and some other logic deeper. What would change it significantly is the full parsing approach, but as we discussed, that may not be great either, and it will be certainly much more verbose. We would need methods, such as parse_create_table, parse_mysql_create_table_field, parse_mysql_create_table_constraint, etc.

I could try to go further with this approach, focusing on more complex things like ALTER TABLE, and see where it takes us. What do you think?

JanJakes avatar Nov 26 '24 17:11 JanJakes

After exploring the rabbit hole of emulating the ALTER TABLE statements that are not supported by SQLite, I brainstormed with @adamziel about the possible ways to address this, and came with the following idea:

Let's implement a subset of MySQL's INFORMATION_SCHEMA tables.

This will provide us with the foundation for supporting these constructs:

  • Some ALTER TABLE ... CHANGE ..., ALTER TABLE ... MODIFY ..., and other ALTER ... statements.
  • Some DESCRIBE statements.
  • SHOW CREATE TABLE statement.
  • Querying the INFORMATION_SCHEMA tables.
  • Likely more that I'm not aware of at the moment.

This approach will serve multiple different use cases while implementing the support for INFORMATION_SCHEMA queries at the same time (https://github.com/WordPress/sqlite-database-integration/issues/78, https://github.com/WordPress/sqlite-database-integration/issues/146).

JanJakes avatar Dec 03 '24 15:12 JanJakes

@JanJakes lovely work here, and I'm really happy about how the information schema is coming together. What's the delta between here and passing all the unit tests?

adamziel avatar Dec 10 '24 16:12 adamziel

What's the delta between here and passing all the unit tests?

@adamziel I think the hardest parts are the CREATE TABLE, ALTER TABLE, and some of the SHOW/DESCRIBE statements. In other words, having the information schema will unlock many of the most complex features, and I think with a decent information schema coverage at the moment, we can start exploring that. The implementation of many of the other missing features could be portable from the legacy translator, and some of them are already in your prototype — SQL functions, SQL_CALC_FOUND_ROWS, FROM DUAL, etc.

Now I want to look into the information_schema -> SQLite CREATE TABLE conversion, which is needed for ALTER statements that need to copy the table. And having that, I think we can simply use it for CREATE TABLE statements as well. Then we also need information_schema -> MySQL CREATE TABLE for SHOW statements, etc.

There is still quite a bit of work to do, but with the hardest stuff getting done first, I think the number of failing tests will decrease faster and faster. In the end, I think that when we pass the existing test case, we'll immediately have a higher MySQL compatibility due to the nature of the new AST & information schema approach.

JanJakes avatar Dec 11 '24 10:12 JanJakes

🎄 Update

I'm finally here with an update, I'm sorry for being overdue. It's a longer read, probably should've been split into multiple.

I'll start with a summary:

  1. 61% (71/116) of the original tests are passing. Before starting the INFORMATION_SCHEMA rewrite, it was 33% (38/116).
  2. A significant portion of some of the hardest statements is done — CREATE TABLE and ALTER TABLE operations.
  3. The basic forms of SHOW and DESCRIBE using the INFORMATION_SCHEMA are implemented.
  4. However, I think there's also still quite a bit of work left to get to 100%, especially to get everything correct and stable.
  5. The remaining failing tests are:
    • String matching and regexp support (LIKE, RLIKE, etc.) — 8 tests.
    • Date/time related columns — 5 tests.
    • SQL_CALC_FOUND_ROWS — 4 tests.
    • LEFT function — 4 tests.
    • SHOW TABLES with LIKE and FROM — 4 tests.
    • ON CONFLICT ... — 3 tests.
    • SELECT FROM DUAL — 2 tests.
    • Other things like CREATE TEMPORARY TABLE, HAVING COUNT variations, CURRENT_TIMESTAMP syntax, ON UDPATE, CAST as binary, etc.
  6. For more details, check out the WP_SQLite_Information_Schema_Builder.

The full list of failures as of today can be found here: https://gist.github.com/JanJakes/23e36446b24c88bee7ca81415d9f6059


Now, I'd like to share some more details about the progress, and the rabbit holes I encountered along the way.

Simple statements -> Basic DDL -> Information schema

A couple of weeks ago, I started to implement the new SQLite driver to translate MySQL statements to SQLite statements, using the new MySQL parser. I implemented the basics of a few simple statements, such as SELECT, INSERT, UPDATE, REPLACE, and DELETE, and then I switched my focus to DDL statements, such as CREATE TABLE and ALTER TABLE.

This turned out to bring some complexities, and after further discussions, and realizing that a pure MySQL -> SQLite DDL conversion is lossy, we decided to implement a subset of MySQL's INFORMATION_SCHEMA tables.

Implementing INFORMATION_SCHEMA tables

First, I decided to focus on the schema tables that are necessary to describe tables, columns, indexes, and constraints. I started by implementing the CREATE TABLE statement. This is essentially an AST to information schema query conversion. In its basics, it's a simple logic, but reaching high coverage and correctness turned out to be pretty extensive.

Here are some interesting issues I encountered:

  • Some columns depend on each other. Sometimes, a charset can determine a collation (we need a full map of default collations for each charsets), sometimes a collation can determine a charset (here, it's just its prefix). Column lengths and numeric precisions depend on data types, charset (we need a map of charset to maximum bytes per character), etc.
  • Some tables depend on each other. Adding or dropping a constraint may affect the nullability and some other attributes of a column as well, and this data needs to be synchronized. Maybe in the future, we could reduce the number of actual information schema tables, and use views to present the data (MySQL has a similar approach from version 8).
  • Some operations have non-trivial effects on the schema. Dropping a column will remove it from existing indexes, meaning that multi-column index records may need to be renumbered, and other column attributes may change.
  • Some information schema values require a lot of normalization. There are so many was to write varchars, national varchars, and text fields (char varying, character varying, varchar, national varchar, nvarchar, nchar varchar, national char varying, nchar varying, long char varying, long character varying, long varchar, ...), and some keywords can change their meaning (long converts varchar to text, appending CHARSET BINARY seems to change varchar to varbinary, etc.).
  • For some values, we may need to implement an AST -> MySQL string conversion. More on that below.

I've either addressed/TODO-ed all the issues I've encountered, but a more extensive test case for some of them is still needed. More details can be seen in WP_SQLite_Information_Schema_Builder which implements all of the AST -> information schema logic.

MySQL AST -> string conversion

As I was working on the information schema builder, I realized I not only need to read the MySQL AST, but in some cases, it's also necessary to convert some nodes back to strings. A simple example of this is reading a table name from a node that has multiple nesting levels below (identifier -> pureIdentifier -> IDENTIFIER). A more complex example is a column definition with a DEFAULT(expr), in which case I need to save the serialized expression in the INFORMATION_SCHEMA.

I think there are multiple approaches we could consider:

  1. Maybe just walking the tree and joining all tokens with spaces would work.
  2. We could store the original whitespaces in the AST, and then join the values.
  3. We could store offsets in the original SQL input for each node, and then just cut out what we need.
  4. We could implement an AST -> SQL string printer, maybe a bit more involved than option 1.

Additionally, I often need to get a bit more than the original string. For identifiers, the value is not enough — I need to unquote it first (replace quoted sequences by characters, trim quotes), and it is a question where such logic belongs.

ALTER TABLE

The next topic after storing the CREATE TABLE information in the INFORMATION_SCHEMA was implementing support for the most common table altering operations. For that, I refactored my initial draft to reusable methods, and realized that ALTER TABLE statements brings another set of challenges.

SQLite has a limited support for ALTER TABLE statements, so I started by trying to map those statements that SQLite can handle to their SQLite counterparts, while falling back to a table-copying approach for the other ones. This turned out to be a rabbit hole:

  1. ALTER TABLE statements can contain multiple operations, and they often do. Therefore, I needed to try to first decide whether I needed to copy the table, or whether I will be able to just map a single MySQL statement to multiple SQLite ALTER TABLE statements.
  2. Determining which MySQL ALTER TABLE statements are directly supported by SQLite is not straightforward. It may seem that an operation, such as ADD COLUMN is simply supported, but the reality is that it might not be in all cases (e.g., with FIRST/AFTER ...).
  3. Implementing the "direct" plus "fallback to full copy" approach means doing many things twice.

For these reasons and after some experimenting, I decided to first focus on implementing the full table-copy ALTER TABLE algorithm, leaving the direct translation of some operations as an optimization for later. The approach is pretty straightforward:

  1. Reflect all ALTER TABLE operations in the INFORMATION_SCHEMA.
  2. Run the SQLite ALTER TABLE algorithm with full table copy as per the SQLite docs.

This turned out to work flawlessly and to be very helpful in the initial implementation phase. Implementing a new ALTER TABLE operations is as simple as reflecting its changes in the information schema at the moment. This is great for broadening the statement support before focusing on optimizing some of the statements.

Finally, the ALTER TABLE algorithm recommended in the SQLite docs requires only a one table copy, so in the new driver, we don't do two copies like we did in the old one.

Connecting the pieces

What turned out to be challenging in this first phase of implementing the INFORMATION_SCHEMA support was the fact that until I got to some level of completeness, I wasn't able to see any progress in terms of passing tests, and it wasn't easy to estimate where exactly we were regarding the compatibility with the existing SQLite translator.

This is because:

  1. The actual database operations can't be implemented before the data is stored in INFORMATION_SCHEMA.
  2. Saving only some data in the INFORMATION_SCHEMA means the tests will fail because the data is incomplete.
  3. Implementing the INFORMATION_SCHEMA builder and the CREATE/ALTER TABLE operations is not enough either. Most of the tests verify the results using SHOW and DESCRIBE statements.

Therefore, while working on the INFORMATION_SCHEMA, I haven't seen any progress on the original SQL translation test suite (but I did write a new set of tests along the way). But the reward comes at the end — quickly prototyping SHOW and DESCRIBE statements with the INFORMATION_SCHEMA at hand turned out to be easy and straightforward, and suddenly, many tests started to pass, while for some others, this revealed incorrect assertions, as the old driver doesn't always have 100% correct data at hand.

Next steps

There is still a good amount of work to get to 100% of the test suite for the current SQLite translator, but I think in the current state, there is a good foundation for many of the hard parts, and building on that will hopefully be easier. Above that, there are some MySQL features that can likely be ported from the old translator with not so much of a rewrite.

Additionally, the existing test suite may not fully cover everything that the current translator implements, so I think it would be good to look at multiple measures:

  1. The percentage of passing tests in the existing test suite.
  2. Checking the existing translator method by method and ensuring we implement all the operations.
  3. Running WordPress with the new driver.
  4. Examining the new driver using Playground Tester.

That said, maybe the path to get this merged, e.g., with a feature flag, is to focus on point 1, and do the rest in other PRs.

JanJakes avatar Dec 20 '24 15:12 JanJakes

Monumental work here @JanJakes, and a tremendous update. Mad props.

That said, maybe the path to get this merged, e.g., with a feature flag, is to focus on point 1, and do the rest in other PRs.

I would love that! It feels like a good spot to park, do a round or a few rounds of review, and merge. Let's just rewire the CI tests to the old driver to avoid failing checks on trunk.

adamziel avatar Dec 20 '24 17:12 adamziel

Superseded by #179

adamziel avatar May 30 '25 13:05 adamziel