piccolo Unexpected changes to ID values during object creation in sqlite

trafficstars

I am observing a situation where a BigInt primary key that is not defined to auto-increment results in a different value than what I am passing in to be saved.

The issue may be sqlite specific. I have not tried this with postgres yet.

Schema

The trends field should not matter here, just including it for completeness in how I am current setup:

class Tweet(Table, tablename="tweets"):

    tweet_id: int = BigInt(primary_key=True)
    data: dict = JSON()
    trends = M2M(LazyTableReference("TrendTweet", module_path=__name__))

Migration

async def forwards():
    manager = MigrationManager(
        migration_id=ID,
        app_name="myapp",
        description=DESCRIPTION,
    )

    manager.add_table("Tweet", tablename="tweets")

    manager.add_column(
        table_class_name="Tweet",
        tablename="tweets",
        column_name="tweet_id",
        db_column_name="tweet_id",
        column_class_name="BigInt",
        column_class=BigInt,
        params={
            "default": None,
            "null": False,
            "primary_key": True,
            "unique": True,
            "index": True,
            "index_method": IndexMethod.btree,
            "choices": None,
            "db_column_name": None,
            "secret": False,
        },
    )

    manager.add_column(
        table_class_name="Tweet",
        tablename="tweets",
        column_name="data",
        db_column_name="data",
        column_class_name="JSON",
        column_class=JSON,
        params={
            "default": "{}",
            "null": False,
            "primary_key": False,
            "unique": False,
            "index": False,
            "index_method": IndexMethod.btree,
            "choices": None,
            "db_column_name": None,
            "secret": False,
        },
    )

The resulting sqlite schema

which looks correct to me:

CREATE TABLE tweets ("tweet_id" INTEGER PRIMARY KEY UNIQUE NOT NULL DEFAULT null, "data" JSON NOT NULL DEFAULT '{}');
CREATE INDEX tweets_tweet_id ON tweets ("tweet_id");

Example code snippet

        for _tweet in tweets: # tweets is a list of response objects from the twitter API
            print("GET OR CREATE:", _tweet.id)
            tweet = Tweet.objects().get_or_create( Tweet.tweet_id == _tweet.id,
                defaults={ "data": _tweet }).run_sync()
            print("TWEET OBJ ID ", tweet.tweet_id)
            print("ORIG TWEET ID", _tweet["id"])

Output showing the problem

GET OR CREATE: 1587206613131317250
TWEET OBJ ID  1587206613131317248
ORIG TWEET ID 1587206613131317250

The issue does not seem to be specific to get_or_create. I have also used create:

            tweet = Tweet.objects().where( Tweet.tweet_id == _tweet.id).first().run_sync()
            if tweet is None:
                tweet = Tweet.objects().create (tweet_id=_tweet.id, data=_tweet).run_sync()

and with save ...

            tweet = Tweet.objects().where( Tweet.tweet_id == _tweet.id).first().run_sync()
            if tweet is None:
                tweet = Tweet(tweet_id=_tweet.id, data=_tweet)
                tweet.save().run_sync()

I see the same problem with these other approaches.

A particularly odd thing is that it is not consistent. Sometimes my loop runs for as many as 3 or 4 tweets, meaning the ID that results is correct in those cases. The resulting tweet_id on the created object seems to be anywhere from n+0 to n+3 where n is the original ID.

I am very confused by this. I know the above does not really isolate the problem in a replicable way, but maybe you have a theory about what might be happening?

Oct 31 '22 22:10 scott2b

As a workaround, I have resorted to a raw insert query as follows:

            tweet = Tweet.objects().where( Tweet.tweet_id == _tweet.id).first().run_sync()
            if tweet is None:
                q = Tweet.raw("insert into tweets (tweet_id, data) values ({}, {})",
                    _tweet.id, json.dumps(_tweet)).run_sync()

This works. The fact that this works makes me think that the problem is somewhere in piccolo's query formation. I've tried getting the queries for code previously described, but I'm not sure how to do that for a create statement. Attempting to print the query results in a Create class representation, not a raw SQL query.

Nov 01 '22 00:11 scott2b

@scott2b It sounds like there might be a bug here:

https://github.com/piccolo-orm/piccolo/blob/fbf298f763432859c9682a8197608e633e3326dd/piccolo/engine/sqlite.py#L457-L468

We use that work around because older versions of SQLite don't support the RETURNING clause.

I would need to investigate this further.

Nov 01 '22 18:11 dantownsend

I put together a test script:

from piccolo.table import Table
from piccolo.columns.column_types import BigInt, JSON
from piccolo.engine.sqlite import SQLiteEngine


DB = SQLiteEngine()


class Tweet(Table, tablename="tweets", db=DB):
    tweet_id: int = BigInt(primary_key=True)
    data: dict = JSON()


async def main():
    if await Tweet.table_exists():
        await Tweet.alter().drop_table()

    await Tweet.create_table()

    for i in range(1, 100):
        tweet = await Tweet.objects().create(tweet_id=i, data={'message': 'hello world'})
        assert tweet.tweet_id == i

    print(await Tweet.select())


if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

This works, so at least we know the create method works. So it might be something isolated with get_or_create - will check that next.

Nov 01 '22 19:11 dantownsend

OK, I think I know what's going on now. I updated the example script again:

from piccolo.table import Table
from piccolo.columns.column_types import Integer, JSON
from piccolo.engine.sqlite import SQLiteEngine


DB = SQLiteEngine()


class Tweet(Table, tablename="tweets", db=DB):
    tweet_id: int = Integer(primary_key=True)
    data: dict = JSON()



async def main():
    if await Tweet.table_exists():
        await Tweet.alter().drop_table()

    await Tweet.create_table()

    start = 1000000000

    # Make sure `create` assigns primary key values correctly.
    for i in range(start, start + 100):
        tweet = await Tweet.objects().create(
            tweet_id=i,
            data={'message': 'hello world'}
        )
        assert tweet.tweet_id == i

    # Make sure `get_or_create` assigns primary key values correctly.
    for i in range(start, start + 100):
        tweet = await Tweet.objects().get_or_create(
            Tweet.tweet_id == i,
            defaults={'message': 'hello world 2'}
        )
        assert tweet.tweet_id == i

    print(await Tweet.select())


if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

It works fine with get_and_create too. However, I think the problem is when the numbers are really large. In my example script, if I change start to 1587206613131317250 it will fail. I think the number is too large for SQLite, and it starts behaving strangely.

You might want to store the value in a Varchar instead.

Nov 01 '22 20:11 dantownsend

This article explains things: http://jakegoulding.com/blog/2011/02/06/sqlite-64-bit-integers/

Nov 01 '22 20:11 dantownsend

Huh. Interesting. I don't have problems with large numbers (specifically tweet IDs) if I insert via raw queries. But I will keep this in mind. Thank you for looking into it.

Nov 01 '22 21:11 scott2b

I don't have problems with large numbers (specifically tweet IDs) if I insert via raw queries

Hmm, that is interesting.

Nov 01 '22 21:11 dantownsend

The test script is handy, thanks. I've made a couple of minor tweaks. I am wondering if there is a way I can get piccolo to show me the queries it is producing? Maybe a logging level, or some way to just explicitly print the SQL for a create statement?

Nov 02 '22 02:11 scott2b

You can do this, and it prints out the SQL:

DB = SQLiteEngine(log_queries=True)

Nov 02 '22 07:11 dantownsend

Oh perfect. I have updated the gist with my observations. The raw insert fails when it includes a returning clause to match the way the create statement works:

        r = await Tweet.raw("insert into tweets (tweet_id, data) values ({}, {}) returning tweet_id",
            i, '{"message": "hello world 1"}' )
        tweet_id = r[0]["tweet_id"]
        assert tweet_id == i, f"ID from raw insert call with return not correct. {tweet_id} != {i}"

e.g.:

    AssertionError: ID from raw insert call with return not correct. 1587623774714830848 != 1587623774714830850
    sqlite> select * from tweets;
    1587623774714830850|{"message": "hello world 1"}

But, also interesting, is that within the sqlite console, I get the correct return value for this query:

sqlite> insert into tweets (tweet_id, data) values (1587623774714830850, '{"message": "hello world 1"}') returning tweet_id;
1587623774714830850

Nov 02 '22 17:11 scott2b

fwiw, I am now seeing an issue that appears to be related when using is_in statements. I am no longer using tweet_ids as a primary key, but I do have the tweet_id field. I am doing something like this:

    existing = [t.tweet_id for t in
        Tweet.objects().where(
            (Tweet.trend == trend) & (Tweet.tweet_id.is_in(ids))
        ).run_sync()]
    for e in existing:
        assert e in ids, f"Existing tweet ID {e} is not in IDs queried"

For some list of IDs and am seeing the assertion error.

I think that the solution for me, for now, is to move to a Varchar field as recommended above. This does avoid the problems of the int field -- it just means I have to ~convert IDs from the Twitter API into strings~ use the id_str property from the Twitter API rather than the id.

Nov 03 '22 00:11 scott2b

i think i might be running into the same (or at least a similar) issue:

root@78842219af3b:/# sqlite3 /root/.local/share/casper7/meatball.db 
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .schema meatball_role
CREATE TABLE meatball_role (
    id INTEGER PRIMARY KEY,
    guild_id INTEGER NOT NULL,
    role_id INTEGER NOT NULL
);
CREATE UNIQUE INDEX idx_unique_guild_role_mr ON meatball_role (guild_id, role_id);
sqlite> select * from meatball_role;
id  guild_id            role_id           
--  ------------------  ------------------
1   247116978509053953  814435003437023242
2   740493455137964043  816711809964572682
sqlite> 

root@78842219af3b:/# ipython
Python 3.10.8 (main, Oct 26 2022, 03:28:14) [GCC 10.2.1 20210110]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from casper7_plugin_meatball_day.tables import MeatballRole

In [2]: MeatballRole.select().run_sync()
Out[2]: 
[{'id': 1, 'guild_id': 247116978509053952, 'role_id': 814435003437023232},
 {'id': 2, 'guild_id': 740493455137964032, 'role_id': 816711809964572672}]

... redacted for brevity ...

In [5]: MeatballRole._meta._db.connection_kwargs
Out[5]: 
{'database': '/root/.local/share/casper7/meatball.db',
 'detect_types': 3,
 'isolation_level': None}

In [6]: MeatballRole.select().where(MeatballRole.guild_id == 247116978509053953).run_sync()
Out[6]: [{'id': 1, 'guild_id': 247116978509053952, 'role_id': 814435003437023232}]

In [7]: MeatballRole.select().where(MeatballRole.guild_id == 740493455137964043).run_sync()
Out[7]: [{'id': 2, 'guild_id': 740493455137964032, 'role_id': 816711809964572672}]

in short, the results i get back via piccolo are different to those i get back from sqlite. i can query by the correct guild_id and get back the correct record, but the guild_id in the result is different.

i will try migrating to a varchar column as well.

update: changing my schema to use Text columns instead of Integer for the IDs did fix the issue.

Nov 12 '22 12:11 backwardspy

@backwardspy It's a strange one. I need to write some tests to figure out what's going on.

Nov 12 '22 12:11 dantownsend

@dantownsend yeah it took me a while to narrow it down to that example! please let me know if there's anything i can do to help.

Nov 12 '22 12:11 backwardspy

@backwardspy At the moment I'm just trying to narrow down what the problem could be.

SQLite max integer size - there is a limit, but it's higher than the numbers you're trying to save.
Python 3 - doesn't seem to have an integer size limit, so unlikely to be the problem.
Is it a problem with an intermediate library like aiosqlite, or Piccolo itself?

Such a head scratcher! If you have any ideas let me know.

Nov 12 '22 12:11 dantownsend

i've grabbed an offline copy of the database to play with later, i'll probably run the whole lot through a debugger and see if i can spot where it goes wrong.

Nov 12 '22 12:11 backwardspy

@backwardspy Cool, thanks

Nov 12 '22 12:11 dantownsend

i just spent some time following execution through piccolo and aiosqlite on a small repro project. it seems to get all the way down to the stdlib sqlite3 module before the IDs get broken.

i then tried the same operations just using the sqlite3 module directly but, perhaps unsurprisingly, i was unable to reproduce the issue.

i think next i'll see if i can repro with just aiosqlite, no piccolo, and try to narrow the issue down further.

so far so puzzling!

Nov 27 '22 16:11 backwardspy

@backwardspy Thanks for investigating it 👍

Hopefully we will uncover this mystery!

Nov 27 '22 18:11 dantownsend

piccolo piccolo copied to clipboard

Unexpected changes to ID values during object creation in sqlite

Schema

Migration

The resulting sqlite schema

Example code snippet

Output showing the problem

piccolo
piccolo copied to clipboard