materialize
materialize copied to clipboard
`SERIALIZABLE` isolation may not respond promptly
What version of Materialize are you using?
main
How did you install Materialize?
Materialize Cloud
What is the issue?
The SERIALIZABLE
isolation level can in principle respond immediately if there is non-empty intersection of the [since, upper)
intervals for its inputs.
However, we do not always do this. At least, we currently require timestamps to be at least as large as the least since
of all collections in the same "time domain", under the concern that we might be in the middle of a transaction, and if so we would feel bad bailing out of the transaction because one subsequently queried a collection whose since
was not less or equal to the transaction timestamp.
I'm not sure how we resolve the above, but I think it is more a bug that SERIALIZABLE
is not always responsive. If we want to add some additional hints about "prefer transaction robustness to response latency" that seems fine, but my sense is that we have backed into a corner that most users don't really care to be in.
Candidate acceptance criteria
When in SERIALIZABLE
isolation, not in an explicit BEGIN
transaction block, any SELECT
query against inputs that have non-empty [since, upper)
intersection should be issued with a timestamp in that intersection, that it might resolve without self-imposed timestamp-based latency.
Relevant log output
No response
#14717 closes out #14533, though we think without fixing this issue, #14533 may still occur (although it is much less likely after #14717). I assert that this is more fuel behind this issue, that we can make the behavior of MZ more predictable by resolving this issue, including in cases when parts of the system go down (as in the test).
Consider adding "...while in a transaction" to acceptance criteria to handle multiple selects.
Joe: can we actually know that within pg protocol? Matt believes so.
This behavior was changed in this PR: https://github.com/MaterializeInc/materialize/pull/13996. Previously SERIALIZABLE
isolation always responded promptly, but after that PR SERIALIZABLE
has the possibility of blocking.
This issue provides some information on why the change was made: https://github.com/MaterializeInc/materialize/issues/14038. While that issue is related to this issue, it's a bit orthogonal to this issue and would involve different fixes. A flake in the c-sharp test suite initially triggered this work.
There is also a detailed comment in test/pgtest-mz/parse-started.pt that describes the issue.
The short story is that we had to make this change due to some corner cases of the PostgreSQL extended protocol. The protocol details can be found here: https://www.postgresql.org/docs/current/protocol-flow.html
Issue Cause
I want to expand on why objects are not always queryable in SERIAZABLE
isolation mode, if it's not already obvious from the issue description. When in a read-only transaction, a client can issue a SELECT
statement against any object within the same "time domain". The definition of time domain isn't that important for this issue and for all practical purposes can be replaced with "schema". So a read-only query can read from any object within the same schema.
The read timestamp of a read-only query is determined by the first read of the transaction and used for all subsequent reads in the transaction until that transaction commits. Therefore when determining a read timestamp in a read-only transaction, we need to make sure that the timestamp is valid for all objects in the schema, in case the client later queries another object.
So for example imagine you have schema S
, with two objects A
and B
that have the following since
s and upper
s:
Object | since |
upper |
---|---|---|
A | ts |
ts + 10 |
B | ts + 20 |
ts + 30 |
So when execute the following series of commands:
BEGIN;
SELECT * FROM A;
We will choose a timestamp of ts + 20
, so that we can later read from B
. Since A
has an upper
of ts + 10
, this query will block until A
's upper
advances past ts + 20
. Additionally read holds will be taken on A
and B
at ts + 20
for the entire duration of the transaction, to prevent compaction of any object past ts + 20
.
Ideally if we just sent the single command:
SELECT * FROM A;
then, since we are in a single statement transaction, we could choose a timestamp of ts + 9
and not block at all. However, due to some quirks of the PostgreSQL protocol, it is sometimes ambiguous if we are in a single statement transaction or multi-statement transaction (this is expanded in the next section). Therefore, in Materialize we always assume that we're in a multi-statement transaction and always look at all objects in the schema when selecting a timestamp.
Relevant PostgreSQL Protocol Overview
See https://www.postgresql.org/docs/current/protocol-flow.html for the complete protocol.
Here I will give a brief overview of the relevant parts of the PostgreSQL Protocol that can make it ambiguous if we are in a single or multiple statement transaction.
- PostgreSQL has two protocols, the Simple Query Protocol and the Extended Query Protocol.
- Every statement, regardless of protocol, is part of a transaction. Either an explicit transaction or implicit transaction.
Explicit Transaction
An explicit transaction starts with a BEGIN
command and ends with a COMMIT
or ROLLBACK
command. All statements in-between these two commands are part of an explicit transaction. The majority of these transactions will be multiple statements, otherwise the client generally wouldn't use an explicit transaction. In my opinion, it's fine to treat single statement explicit transactions as a multiple statement transaction since the user is explicitly starting a transaction. Explicit transaction statements can also be spread over multiple network messages, so it's often impossible to tell during the first read whether or not the transaction will contain multiple statements.
Implicit Transaction
Implicit Transactions differ between the Simple and Extended protocol.
Simple protocol
The Simple protocol allows the user to batch multiple statements together into a single message, separated by semi-colons. All statements in the same message, that are not part of an explicit transaction, are part of an implicit transaction (this is a bit of a simplification, but it's fine for understanding this issue).
So for example in the following message:
SELECT * FROM t; INSERT INTO t VALUES (1); SELECT * FROM t;
All three statements are part of the same implicit transaction.
In the following example:
SELECT * FROM t; INSERT INTO t VALUES (1); SELECT * FROM 1/0;
1 is not inserted into the table because the final statement errors, which aborts the entire implicit transaction.
In the following example:
BEGIN; INSERT INTO t VALUES (1); COMMIT; INSERT INTO t VALUES (2); SELECT * FROM 1/0;
1 is inserted but 2 is not inserted. The first insert is part of an explicit transaction that commits successfully. The last two statements are part of an implicit transaction that doesn't commit successfully.
In the simple protocol, it's easy to analyze how many statements each implicit transaction contains since they are all in a single message. Therefore, we should be able to optimize single statement implicit transactions to never block.
Extended protocol
The extended protocol breaks up the execution of a single statement into multiple network messages. Sending the following messages in order is roughly equivalent to executing a single statement:
-
Parse
-
Bind
-
portal Describe
-
Execute
-
Close
(Again this is a simplification, but it's fine for understanding this issue).
All statements that are not part of an explicit transaction, are part an implicit transaction. Sending a Sync
message explicitly commits the currently running implicit transaction.
So the problem is, after executing some SELECT
statement, we have no way of knowing if the next message will be a Sync
message which will commit the current implicit transaction. Or if we will receive another group of Parse
, Bind
, Describe
, Execute
, Close
statements, extending the current implicit transaction to multiple statements.
Proposal
- All explicit transactions are treated as multiple statement transactions. When selecting a timestamp under
SERIALIZABLE
, we consider all objects within the same time domain (schema). Queries may end up blocking if there are objects with non-overlapping[since, upper)
ranges. - When selecting a timestamp for implicit transactions, we only look at the first object being read. If that timestamp is not valid for later reads, then we will return an error with a helpful description and hint saying that if you want to run multiple reads in a single transaction, then you should use explicit transactions.
- As a later optimization, for implicit transactions in the simple query protocol, we can only look at the subset of objects in the transactions since all the information is known up front.
When selecting a timestamp for implicit transactions, we only look at the first object being read. If that timestamp is not valid for later reads
How would this interact with compaction holding for nearby objects? Would we only hold back compaction for nearby objects whose since is <= the chosen timestamp?
The npgsql .net driver performs two selects over an implicit txn at its startup. Whatever solution we come up with must not ever cause the second select to fail, because no amount of user education will help there, since it's a driver issue, not a user issue. I would not be surprised if other drivers also do various statements in implicit transactions at start, and expect them to succeed.
If we want to implement this, I would encourage expanding our current smoke tests to include one or two more of the most popular languages so we don't accidentally prevent an entire language from using materialize.
| How would this interact with compaction holding for nearby objects?
We could either not take any read holds OR only take read holds on objects whose since
is <= to the chosen timestamp. I would lean towards not taking any read holds and if this fails anyone's queries then pushing them towards explicit transactions. However,
| The npgsql .net driver performs two selects over an implicit txn at its startup.
That is very unfortunate and means that my proposal would not work.
Perhaps we should differentiate between the extended and simple protocol when choosing timestamps. In the simple protocol, if the implicit transaction only has a single statement then we choose a timestamp that never blocks. However in the extended protocol, we must treat every select as part of a transaction and select a timestamp that would be valid for the entire time domain.
Another option as suggested by @frankmcsherry, we can have a session level flag that indicates if a user has tried and failed to run a multi statement implicit transaction. If they have never tried and failed, then we select a timestamp only considering the first object in implicit transactions. If they have tried and failed previously, then we look at all objects in the time domain when selecting a timestamp. The idea being that we only penalize users that actually use multi statement implicit transactions.
This would require drivers to retry any failed transactions, which I'm not sure if they do.
In the simple protocol, if the implicit transaction only has a single statement then we choose a timestamp that never blocks.
I believe the original intent of the Started
txn state was to assert that exactly 1 statement was being run in the implicit txn. We have options like: teach the extended protocol to stop using Started, and use Implicit instead, then (re-)teach sequence_peek to not care about nearby objects if the txn is in Started. That is: Started now always means exactly 1, and we break from Postgres' txn naming conventions.
I did have another proposal from https://materializeinc.slack.com/archives/C02FWJ94HME/p1662774696784379?thread_ts=1662741366.405969&cid=C02FWJ94HME:
The problem with .net wasn't that there were two statements being executed at different timestamps (I think I was wrong about this above), it was that mz didn't hold back compaction for some of the things used in the second statement. So: we could auto commit after all of these, and be able to get both .net works fine and can turn back on single statement optimizations.
This is in spec even though it won't execute both .net statements at the same timestamp.
I'm going to go with @mjibson's approach of eagerly committing SELECT
s that are part of an implicit transaction in the extended protocol.
By the way, for records sake, the start up message flow of the C# PostgreSQL driver looks like this:
Parse { name: "", sql: "\r\nSELECT version()", param_types: [] }
Bind { portal_name: "", statement_name: "", param_formats: [], raw_params: [], result_formats: [Text] }
DescribePortal { name: "" }
Execute { portal_name: "", max_rows: 0 }
Parse { name: "", sql: "SELECT ns.nspname, typ_and_elem_type.*,\r\n CASE\r\n WHEN typtype IN ('b', 'e', 'p') THEN 0 -- First base types, enums, pseudo-types\r\n WHEN typtype = 'r' THEN 1 -- Ranges after\r\n WHEN typtype = 'c' THEN 2 -- Composites after\r\n WHEN typtype = 'd' AND elemtyptype <> 'a' THEN 3 -- Domains over non-arrays after\r\n WHEN typtype = 'a' THEN 4 -- Arrays before\r\n WHEN typtype = 'd' AND elemtyptype = 'a' THEN 5 -- Domains over arrays last\r\n END AS ord\r\nFROM (\r\n -- Arrays have typtype=b - this subquery identifies them by their typreceive and converts their typtype to a\r\n -- We first do this for the type (innerest-most subquery), and then for its element type\r\n -- This also returns the array element, range subtype and domain base type as elemtypoid\r\n SELECT\r\n typ.oid, typ.typnamespace, typ.typname, typ.typtype, typ.typrelid, typ.typnotnull, typ.relkind,\r\n elemtyp.oid AS elemtypoid, elemtyp.typname AS elemtypname, elemcls.relkind AS elemrelkind,\r\n CASE WHEN elemproc.proname='array_recv' THEN 'a' ELSE elemtyp.typtype END AS elemtyptype\r\n FROM (\r\n SELECT typ.oid, typnamespace, typname, typrelid, typnotnull, relkind, typelem AS elemoid,\r\n CASE WHEN proc.proname='array_recv' THEN 'a' ELSE typ.typtype END AS typtype,\r\n CASE\r\n WHEN proc.proname='array_recv' THEN typ.typelem\r\n WHEN typ.typtype='r' THEN rngsubtype\r\n WHEN typ.typtype='d' THEN typ.typbasetype\r\n END AS elemtypoid\r\n FROM pg_type AS typ\r\n LEFT JOIN pg_class AS cls ON (cls.oid = typ.typrelid)\r\n LEFT JOIN pg_proc AS proc ON proc.oid = typ.typreceive\r\n LEFT JOIN pg_range ON (pg_range.rngtypid = typ.oid)\r\n ) AS typ\r\n LEFT JOIN pg_type AS elemtyp ON elemtyp.oid = elemtypoid\r\n LEFT JOIN pg_class AS elemcls ON (elemcls.oid = elemtyp.typrelid)\r\n LEFT JOIN pg_proc AS elemproc ON elemproc.oid = elemtyp.typreceive\r\n) AS typ_and_elem_type\r\nJOIN pg_namespace AS ns ON (ns.oid = typnamespace)\r\nWHERE\r\n typtype IN ('b', 'r', 'e', 'd') OR -- Base, range, enum, domain\r\n (typtype = 'c' AND relkind='c') OR -- User-defined free-standing composites (not table composites) by default\r\n (typtype = 'p' AND typname IN ('record', 'void')) OR -- Some special supported pseudo-types\r\n (typtype = 'a' AND ( -- Array of...\r\n elemtyptype IN ('b', 'r', 'e', 'd') OR -- Array of base, range, enum, domain\r\n (elemtyptype = 'p' AND elemtypname IN ('record', 'void')) OR -- Arrays of special supported pseudo-types\r\n (elemtyptype = 'c' AND elemrelkind='c') -- Array of user-defined free-standing composites (not table composites) by default\r\n ))\r\nORDER BY ord", param_types: [] }
Bind { portal_name: "", statement_name: "", param_formats: [], raw_params: [], result_formats: [Text] }
DescribePortal { name: "" }
Execute { portal_name: "", max_rows: 0 }
Parse { name: "", sql: "-- Load field definitions for (free-standing) composite types\r\nSELECT typ.oid, att.attname, att.atttypid\r\nFROM pg_type AS typ\r\nJOIN pg_namespace AS ns ON (ns.oid = typ.typnamespace)\r\nJOIN pg_class AS cls ON (cls.oid = typ.typrelid)\r\nJOIN pg_attribute AS att ON (att.attrelid = typ.typrelid)\r\nWHERE\r\n (typ.typtype = 'c' AND cls.relkind='c') AND\r\n attnum > 0 AND -- Don't load system attributes\r\n NOT attisdropped\r\nORDER BY typ.oid, att.attnum", param_types: [] }
Bind { portal_name: "", statement_name: "", param_formats: [], raw_params: [], result_formats: [Text] }
DescribePortal { name: "" }
Execute { portal_name: "", max_rows: 0 }
Parse { name: "", sql: "-- Load enum fields\r\nSELECT pg_type.oid, enumlabel\r\nFROM pg_enum\r\nJOIN pg_type ON pg_type.oid=enumtypid\r\nORDER BY oid, enumsortorder", param_types: [] }
Bind { portal_name: "", statement_name: "", param_formats: [], raw_params: [], result_formats: [Text] }
DescribePortal { name: "" }
Execute { portal_name: "", max_rows: 0 }
Sync
Do you need to eagerly commit all statements? What happens if there are some INSERTs or UPDATEs, for example?
I don't have this working on with Materialize yet. I just made some hacky changes and added a println!
so I could see the messages.
In order for this to work consistently, we would need to eagerly commit all the SELECT
statements.
With the current proposal, we would allow multiple INSERT
s in a single implicit transaction, without eagerly committing them. I think this is actually probably fine.
For now, we should disallow implicit transactions that mix SELECT
s and INSERT
s. We should also disallow implicit transactions with UPDATE
(or any other r-w statement) that contain multiple statements.
I guess what you're saying though is that it might be a better idea to eagerly commit INSERT
s, UPDATE
s, etc, instead of waiting for another statement and aborting?
The final resolution was to eagerly commit ALL statements.