crate-python icon indicating copy to clipboard operation
crate-python copied to clipboard

SQLAlchemy: Polyfill for `AUTOINCREMENT` columns

Open amotl opened this issue 2 years ago • 10 comments

About

CrateDB does not support the notion of autoincrement columns, because it is a distributed database. For supporting certain applications, specific workarounds have been showing a successful outcome. Hereby, we would like to evaluate how a corresponding patch could be generalized, to be optionally enabled on the CrateDB SQLAlchemy dialect.

Reference

  • https://github.com/crate/crate/issues/11020

Details

CrateDB can generate unique values server-side by using the GEN_RANDOM_TEXT_UUID() function, which is applicable for TEXT types.

"columnname" TEXT DEFAULT GEN_RANDOM_TEXT_UUID() NOT NULL

While working on mlflow-cratedb, we had the need to create unique Integer-based values, so we added a corresponding monkeypatch, which also has a remark:

TODO: Submit patch to crate-python, to be enabled by a dialect parameter crate_polyfill_autoincrement or such.

Proposal

So, the idea here is to add a dialect parameter crate_polyfill_autoincrement, which will, under the hood, transparently augment SQLAlchemy models to swap in a different procedure, depending on its column type. For text-based columns, the server-side DEFAULT GEN_RANDOM_TEXT_UUID() may be applicable, while for integer-based columns, the patching would need to resort to a timestamp [^1][^2].

Followup

If that turns out well, make sure to report the outcome back to the original ticket https://github.com/crate/crate/issues/11020, and also write a "best practice" community post about it.

[^1]: I don't know yet how or whether the timestamp resolution should be configurable at all. It would probably be best to always use nanosecond resolution, in order to reduce the chance of collisions in high-traffic/-volume/-performance environments. If that is not an issue, millisecond resolution might be enough, but making it configurable would probably be a mess. [^2]: For the column to be able to store large integer numbers like Unix time since Epoch in nanoseconds, it would need to be converged into a BIGINT, if it was defined as an INT only.

amotl avatar Sep 14 '23 13:09 amotl

Another option for INT auto-increments which could be useful in some cases, as something that a developer would need to consciously enable, could be for the driver to generate the values working on the knowledge that it is the only client adding rows. So it could get current max at the first request and then manage a sequence from then on.

hlcianfagna avatar Sep 14 '23 13:09 hlcianfagna

Hi Hernan. It always sounds nice to gain "kind of autoincrement" features, but, at the same time, always falls short, as there are too many ambiguities and spots where things can go wrong. In this manner, the disadvantages outweigh the advantages. Just to enumerate a few examples:

  • Even if the driver would know it is the only one, it would not know if it is used concurrently.
  • The driver would need to keep sequence state per connection, and can't communicate that state to other connections, neither in-process nor ex-process.
  • If an INSERT goes wrong after generating an identifier, what to do?

This issue was primarily raised to track carrying over the code which has been conceived at mlflow-cratedb, which is validated by a real-world SQLAlchemy application and corresponding test cases now, so I think it is valuable to think about generalizing it.

I am not too fancy to think about different solutions which are even constrained so much, and offer plenty of chances to use wrong. As we are handling primary keys at this spot, the "what could possibly go wrong" aspect is straight on the table here.

However, when using client-side identifier generation anyway, I am open for any suggestions whether to offer different generation algorithms, if that would not complicate the implementation too much, and would satisfy the uniqueness constraints.

amotl avatar Sep 14 '23 13:09 amotl

Following up on GH-580 by @surister (thanks!), we may think about accompanying this with an alternative, yet different, to emulate autoincrement column constraints based on integer values. This is what we are currently using to compensate one spot at https://github.com/crate-workbench/langchain/pull/1.

id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier)
import datetime as dt

def generate_autoincrement_identifier() -> int:
    """
    Generate auto-incrementing integer values,
    fitting into the BIGINT data type for a while.

    Based on Nagamani23, it returns the number of
    microseconds elapsed since the beginning of 2023.
    
    The property that the value is increasing, to resemble
    classical autoincrement behavior, is important for the
    corresponding use cases.
    """
    naga23_delta = dt.datetime.utcnow() - dt.datetime(2023, 1, 1)
    naga23_microseconds = int(naga23_delta.total_seconds() * 1_000_000)
    return naga23_microseconds

amotl avatar Sep 17 '23 18:09 amotl

id = sa.Column(sa.BigInteger, primary_key=True, default=generate_autoincrement_identifier)

Why not use now() server_default instead? This would at least use server logic (use server's date/time env instead of possible different clients one). Also the microsecond calculation looks flawn to me as the result is only kinda unique in the seconds range. The server's now() function would be in milisecond range. (https://crate.io/docs/crate/reference/en/5.4/general/builtins/scalar-functions.html#now)

id = sa.Column(sa.BigInteger, primary_key=True, default="now()")

seut avatar Sep 18 '23 09:09 seut

a) It apparently does not work out of the box because NOW() seems to return a string type. Is there a way to the cast the value transparently, or use a different function?

id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")
SQLParseException[Cannot cast `'NOW()'` of type `text` to type `bigint`]

b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose.

>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871

amotl avatar Sep 18 '23 10:09 amotl

a) It apparently does not work out of the box because NOW() seems to return a string type. Is there a way to the cast the value transparently, or use a different function?

id = sa.Column(sa.BigInteger, primary_key=True, server_default="NOW()")
SQLParseException[Cannot cast `'NOW()'` of type `text` to type `bigint`]

b) Microsecond resolution works reasonably well. Millisecond resolution might be too low in high performance or high concurrency situtations, which a typical test suite will produce on its own, or simulate on purpose.

>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500672.265006
>>> (dt.datetime.utcnow() - dt.datetime(2023, 1, 1)).total_seconds()
22500674.232871

How about:

https://docs.python.org/3/library/time.html#time.perf_counter

>>> time.perf_counter_ms()
9720976934550

surister avatar Sep 18 '23 10:09 surister

Thanks for the suggestion, but it has a drawback:

The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

amotl avatar Sep 18 '23 10:09 amotl

NOW() seems to return a string type. Is there a way to the cast the value transparently

now()::BIGINT

or

(extract(EPOCH FROM now()) * 1000)

?

hlcianfagna avatar Sep 18 '23 10:09 hlcianfagna

With SQL DDL, it works well, even without a cast.

create table testdrive (foo bigint default now(), bar text);
insert into testdrive (bar) values ('baz');
cr> select * from testdrive;
+---------------+-----+
|           foo | bar |
+---------------+-----+
| 1695034699106 | baz |
+---------------+-----+
SELECT 1 row in set (0.881 sec)

And when using this SQLAlchemy definition (which I got wrong before), it also makes a corresponding LangChain test case succeed.

sa.Column(sa.BigInteger, primary_key=True, server_default=sa.func.now())

Excellent, I will change the corresponding code, to check if it also works well on MLflow, and will also add relevant documentation improvements.

amotl avatar Sep 18 '23 11:09 amotl

@seut: Your suggestion worked well to satisfy both test suites, see ^1. Thank you very much.

amotl avatar Sep 18 '23 11:09 amotl