piccolo
piccolo copied to clipboard
Unexpected changes to ID values during object creation in sqlite
I am observing a situation where a BigInt primary key that is not defined to auto-increment results in a different value than what I am passing in to be saved.
The issue may be sqlite specific. I have not tried this with postgres yet.
Schema
The trends field should not matter here, just including it for completeness in how I am current setup:
class Tweet(Table, tablename="tweets"):
tweet_id: int = BigInt(primary_key=True)
data: dict = JSON()
trends = M2M(LazyTableReference("TrendTweet", module_path=__name__))
Migration
async def forwards():
manager = MigrationManager(
migration_id=ID,
app_name="myapp",
description=DESCRIPTION,
)
manager.add_table("Tweet", tablename="tweets")
manager.add_column(
table_class_name="Tweet",
tablename="tweets",
column_name="tweet_id",
db_column_name="tweet_id",
column_class_name="BigInt",
column_class=BigInt,
params={
"default": None,
"null": False,
"primary_key": True,
"unique": True,
"index": True,
"index_method": IndexMethod.btree,
"choices": None,
"db_column_name": None,
"secret": False,
},
)
manager.add_column(
table_class_name="Tweet",
tablename="tweets",
column_name="data",
db_column_name="data",
column_class_name="JSON",
column_class=JSON,
params={
"default": "{}",
"null": False,
"primary_key": False,
"unique": False,
"index": False,
"index_method": IndexMethod.btree,
"choices": None,
"db_column_name": None,
"secret": False,
},
)
The resulting sqlite schema
which looks correct to me:
CREATE TABLE tweets ("tweet_id" INTEGER PRIMARY KEY UNIQUE NOT NULL DEFAULT null, "data" JSON NOT NULL DEFAULT '{}');
CREATE INDEX tweets_tweet_id ON tweets ("tweet_id");
Example code snippet
for _tweet in tweets: # tweets is a list of response objects from the twitter API
print("GET OR CREATE:", _tweet.id)
tweet = Tweet.objects().get_or_create( Tweet.tweet_id == _tweet.id,
defaults={ "data": _tweet }).run_sync()
print("TWEET OBJ ID ", tweet.tweet_id)
print("ORIG TWEET ID", _tweet["id"])
Output showing the problem
GET OR CREATE: 1587206613131317250
TWEET OBJ ID 1587206613131317248
ORIG TWEET ID 1587206613131317250
The issue does not seem to be specific to get_or_create. I have also used create:
tweet = Tweet.objects().where( Tweet.tweet_id == _tweet.id).first().run_sync()
if tweet is None:
tweet = Tweet.objects().create (tweet_id=_tweet.id, data=_tweet).run_sync()
and with save ...
tweet = Tweet.objects().where( Tweet.tweet_id == _tweet.id).first().run_sync()
if tweet is None:
tweet = Tweet(tweet_id=_tweet.id, data=_tweet)
tweet.save().run_sync()
I see the same problem with these other approaches.
A particularly odd thing is that it is not consistent. Sometimes my loop runs for as many as 3 or 4 tweets, meaning the ID that results is correct in those cases. The resulting tweet_id on the created object seems to be anywhere from n+0 to n+3 where n is the original ID.
I am very confused by this. I know the above does not really isolate the problem in a replicable way, but maybe you have a theory about what might be happening?
As a workaround, I have resorted to a raw insert query as follows:
tweet = Tweet.objects().where( Tweet.tweet_id == _tweet.id).first().run_sync()
if tweet is None:
q = Tweet.raw("insert into tweets (tweet_id, data) values ({}, {})",
_tweet.id, json.dumps(_tweet)).run_sync()
This works. The fact that this works makes me think that the problem is somewhere in piccolo's query formation. I've tried getting the queries for code previously described, but I'm not sure how to do that for a create statement. Attempting to print the query results in a Create class representation, not a raw SQL query.
@scott2b It sounds like there might be a bug here:
https://github.com/piccolo-orm/piccolo/blob/fbf298f763432859c9682a8197608e633e3326dd/piccolo/engine/sqlite.py#L457-L468
We use that work around because older versions of SQLite don't support the RETURNING clause.
I would need to investigate this further.
I put together a test script:
from piccolo.table import Table
from piccolo.columns.column_types import BigInt, JSON
from piccolo.engine.sqlite import SQLiteEngine
DB = SQLiteEngine()
class Tweet(Table, tablename="tweets", db=DB):
tweet_id: int = BigInt(primary_key=True)
data: dict = JSON()
async def main():
if await Tweet.table_exists():
await Tweet.alter().drop_table()
await Tweet.create_table()
for i in range(1, 100):
tweet = await Tweet.objects().create(tweet_id=i, data={'message': 'hello world'})
assert tweet.tweet_id == i
print(await Tweet.select())
if __name__ == '__main__':
import asyncio
asyncio.run(main())
This works, so at least we know the create method works. So it might be something isolated with get_or_create - will check that next.
OK, I think I know what's going on now. I updated the example script again:
from piccolo.table import Table
from piccolo.columns.column_types import Integer, JSON
from piccolo.engine.sqlite import SQLiteEngine
DB = SQLiteEngine()
class Tweet(Table, tablename="tweets", db=DB):
tweet_id: int = Integer(primary_key=True)
data: dict = JSON()
async def main():
if await Tweet.table_exists():
await Tweet.alter().drop_table()
await Tweet.create_table()
start = 1000000000
# Make sure `create` assigns primary key values correctly.
for i in range(start, start + 100):
tweet = await Tweet.objects().create(
tweet_id=i,
data={'message': 'hello world'}
)
assert tweet.tweet_id == i
# Make sure `get_or_create` assigns primary key values correctly.
for i in range(start, start + 100):
tweet = await Tweet.objects().get_or_create(
Tweet.tweet_id == i,
defaults={'message': 'hello world 2'}
)
assert tweet.tweet_id == i
print(await Tweet.select())
if __name__ == '__main__':
import asyncio
asyncio.run(main())
It works fine with get_and_create too. However, I think the problem is when the numbers are really large. In my example script, if I change start to 1587206613131317250 it will fail. I think the number is too large for SQLite, and it starts behaving strangely.
You might want to store the value in a Varchar instead.
This article explains things: http://jakegoulding.com/blog/2011/02/06/sqlite-64-bit-integers/
Huh. Interesting. I don't have problems with large numbers (specifically tweet IDs) if I insert via raw queries. But I will keep this in mind. Thank you for looking into it.
I don't have problems with large numbers (specifically tweet IDs) if I insert via raw queries
Hmm, that is interesting.
The test script is handy, thanks. I've made a couple of minor tweaks. I am wondering if there is a way I can get piccolo to show me the queries it is producing? Maybe a logging level, or some way to just explicitly print the SQL for a create statement?
You can do this, and it prints out the SQL:
DB = SQLiteEngine(log_queries=True)
Oh perfect. I have updated the gist with my observations. The raw insert fails when it includes a returning clause to match the way the create statement works:
r = await Tweet.raw("insert into tweets (tweet_id, data) values ({}, {}) returning tweet_id",
i, '{"message": "hello world 1"}' )
tweet_id = r[0]["tweet_id"]
assert tweet_id == i, f"ID from raw insert call with return not correct. {tweet_id} != {i}"
e.g.:
AssertionError: ID from raw insert call with return not correct. 1587623774714830848 != 1587623774714830850
sqlite> select * from tweets;
1587623774714830850|{"message": "hello world 1"}
But, also interesting, is that within the sqlite console, I get the correct return value for this query:
sqlite> insert into tweets (tweet_id, data) values (1587623774714830850, '{"message": "hello world 1"}') returning tweet_id;
1587623774714830850
fwiw, I am now seeing an issue that appears to be related when using is_in statements. I am no longer using tweet_ids as a primary key, but I do have the tweet_id field. I am doing something like this:
existing = [t.tweet_id for t in
Tweet.objects().where(
(Tweet.trend == trend) & (Tweet.tweet_id.is_in(ids))
).run_sync()]
for e in existing:
assert e in ids, f"Existing tweet ID {e} is not in IDs queried"
For some list of IDs and am seeing the assertion error.
I think that the solution for me, for now, is to move to a Varchar field as recommended above. This does avoid the problems of the int field -- it just means I have to ~convert IDs from the Twitter API into strings~ use the id_str property from the Twitter API rather than the id.
i think i might be running into the same (or at least a similar) issue:
root@78842219af3b:/# sqlite3 /root/.local/share/casper7/meatball.db
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .schema meatball_role
CREATE TABLE meatball_role (
id INTEGER PRIMARY KEY,
guild_id INTEGER NOT NULL,
role_id INTEGER NOT NULL
);
CREATE UNIQUE INDEX idx_unique_guild_role_mr ON meatball_role (guild_id, role_id);
sqlite> select * from meatball_role;
id guild_id role_id
-- ------------------ ------------------
1 247116978509053953 814435003437023242
2 740493455137964043 816711809964572682
sqlite>
root@78842219af3b:/# ipython
Python 3.10.8 (main, Oct 26 2022, 03:28:14) [GCC 10.2.1 20210110]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from casper7_plugin_meatball_day.tables import MeatballRole
In [2]: MeatballRole.select().run_sync()
Out[2]:
[{'id': 1, 'guild_id': 247116978509053952, 'role_id': 814435003437023232},
{'id': 2, 'guild_id': 740493455137964032, 'role_id': 816711809964572672}]
... redacted for brevity ...
In [5]: MeatballRole._meta._db.connection_kwargs
Out[5]:
{'database': '/root/.local/share/casper7/meatball.db',
'detect_types': 3,
'isolation_level': None}
In [6]: MeatballRole.select().where(MeatballRole.guild_id == 247116978509053953).run_sync()
Out[6]: [{'id': 1, 'guild_id': 247116978509053952, 'role_id': 814435003437023232}]
In [7]: MeatballRole.select().where(MeatballRole.guild_id == 740493455137964043).run_sync()
Out[7]: [{'id': 2, 'guild_id': 740493455137964032, 'role_id': 816711809964572672}]
in short, the results i get back via piccolo are different to those i get back from sqlite. i can query by the correct guild_id and get back the correct record, but the guild_id in the result is different.
i will try migrating to a varchar column as well.
update: changing my schema to use Text columns instead of Integer for the IDs did fix the issue.
@backwardspy It's a strange one. I need to write some tests to figure out what's going on.
@dantownsend yeah it took me a while to narrow it down to that example! please let me know if there's anything i can do to help.
@backwardspy At the moment I'm just trying to narrow down what the problem could be.
- SQLite max integer size - there is a limit, but it's higher than the numbers you're trying to save.
- Python 3 - doesn't seem to have an integer size limit, so unlikely to be the problem.
- Is it a problem with an intermediate library like aiosqlite, or Piccolo itself?
Such a head scratcher! If you have any ideas let me know.
i've grabbed an offline copy of the database to play with later, i'll probably run the whole lot through a debugger and see if i can spot where it goes wrong.
@backwardspy Cool, thanks
i just spent some time following execution through piccolo and aiosqlite on a small repro project. it seems to get all the way down to the stdlib sqlite3 module before the IDs get broken.
i then tried the same operations just using the sqlite3 module directly but, perhaps unsurprisingly, i was unable to reproduce the issue.
i think next i'll see if i can repro with just aiosqlite, no piccolo, and try to narrow the issue down further.
so far so puzzling!
@backwardspy Thanks for investigating it 👍
Hopefully we will uncover this mystery!