Doesn't serialize custom types in list/dict
Only recently realized that I could put Dict/List items as values on a NamedTuple cached by cachew, previously was doing some weird stuff
So, in the process of switching more things to support cachew, I ran into an issue here, when one of the items in the List being serialized by cachew couldn't be converted to json
from datetime import datetime
from typing import List, NamedTuple, Iterator
from cachew import cachew
class action(NamedTuple):
dt: datetime
val: int
class wrapped(NamedTuple):
actions: List[action] # list causes this to be serialized with json.dumps
val: str
@cachew
def values() -> Iterator[wrapped]:
yield wrapped(actions=[action(datetime.now(), 5)], val="something")
list(values())
list(values())
generates quite the error:
cachew: error while setting up cache, falling back to non-cached version
(builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
Traceback (most recent call last):
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
param.append(processors[key](compiled_params[key]))
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
return process_param(value, dialect)
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
return json.dumps(value)
File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.9/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 1006, in cachew_wrapper
flush()
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 993, in flush
conn.execute(values_table.insert().values(chunk))
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
return meth(self, multiparams, params)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
ret = self._execute_context(
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1206, in _execute_context
self._handle_dbapi_exception(
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
util.raise_(
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
param.append(processors[key](compiled_params[key]))
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
return process_param(value, dialect)
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
return json.dumps(value)
File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.9/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
cachew: error while setting up cache, falling back to non-cached version
(builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
Traceback (most recent call last):
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
param.append(processors[key](compiled_params[key]))
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
return process_param(value, dialect)
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
return json.dumps(value)
File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.9/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 1006, in cachew_wrapper
flush()
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 993, in flush
conn.execute(values_table.insert().values(chunk))
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
return meth(self, multiparams, params)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
ret = self._execute_context(
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1206, in _execute_context
self._handle_dbapi_exception(
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
util.raise_(
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
param.append(processors[key](compiled_params[key]))
File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
return process_param(value, dialect)
File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
return json.dumps(value)
File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.9/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
I'm not expecting this to be fixed, as I understand the problems with serializing/deserializing from JSON, I was just stuck here for a while as the readme says it should work for lists and datetime, so I wasn't sure where this error why there was an error (its being thrown here)
Makes sense that its using JSON, as otherwise you would have to maintain an intersection table and map ids from another table onto list/dict items, which sounds like a pain
Anyways, as far as any changes, either a better warning message could be raised, and/or the documentation could be updated to better reflect this pitfall, so that no one else runs into the cryptic error in the future?
Yep, definitely agree about the error message.
I guess in principle it would be nice to support as well -- after all, the types that can be stored natively in sqlite and as json are pretty much the same anyway.
In fact in hindsight I'm not sure if I should have really bothered with flattening onto columns at all.
Pros are:
- It gives some nice properties like ability to use sqlite queries
- can potentially be more compact on disk (no need to store key names)
- might be quicker to deserialize? Although I'd expect these benefits to be pretty marginal
Cons are:
- but you can't flatten everything anyway (like Lists/Jsons)
- it makes it a bit more complicated/inconsistent by needing to handle both sqlite columns and json-ish things
An alternative is to simply convert everything to a json-like thing and just store it in the single column. In addition:
- stil possible to do queries against these json objects (although didn't use this much so far https://www.sqlite.org/json1.html#jex )
- still might be possible to optionally 'cache' specific columns as sqlite native on demand
- might be possible to compress the database somehow as well (don't know much about it, and haven't had such need anyway)
- it might make it fairly trivial to support alternative backends (like postgres/redis/etc)
Still not sure about this, what do you think?
An alternative is to simply convert everything to a json-like thing and just store it in the single column
yeah, the only thing I'm unsure of here is if recursive NT/classes would be converted back? I've had a few cases where if I have very basic namedtuples as a list, and it converts it back to a list of lists (converted from a namedtuple to a tuple, and then since JSON doesnt have the concept of a tuple I guess its just stored as a list) instead of a list of namedtuples - here are those definitions
Seems like there isnt a ton of benefit to using a sqlite database then though? Since its just storing JSON blobs, with no type information, its just the sqlalchemy machinery you're using to convert to NTs and back. Though, I don't know if mapping onto the columns is slower than doing the conversion from a dict back to the namedtuple, will just have to be tested...
I do this manually and with a lot more error checking here by inspecting the signature for the namedtuple, but I assume the sqlalchemy bindings are better for this
the only thing I'm unsure of here is if recursive NT/classes would be converted back?
As long as we can rely on schemas (i.e. fully annotated type) passed both on encoding and decoding, this should work? It's not different from laying out on sqlite, e.g. datetime types are stored as strings, but are reconstructed back into datetime because it has access to type info https://github.com/karlicoss/cachew/blob/master/src/cachew/init.py#L383-L385
mapping onto the columns is slower than doing the conversion from a dict back to the namedtuple, will just have to be tested...
Yeah, not sure either
Seems like there isnt a ton of benefit to using a sqlite database then though?
Yeah, good question. And not in FAQ, so wonder if I seriously considered it :) Let's think...
Cons of having a JSONL file over sqlite with single-column json blob:
- might be nice to eventually support in-memory cache by leveraging on sqlalchemy, in that sense makes more sense to keep it sqlite
- hash would need to be kept somewhere. Could just put it in the filename? but then it's harder to recycle old caches... although maybe still possible to do atomically, e.g. take a snapshot of everything with mtimes, and then once the writer finishes, compare them and delete the ones that weren't changed?
Pros:
- atomic guarantees are easier to implement in a sense (comparing to dealing with sqlite): write to a temporary file, and then
renamecall is atomic in linux (at least on reasonable file systems?). And if the file is replaced while someone is reading it, it's also OK, the readers' file descriptors still linger, so in this sense it's all good - implementing retention, i.e. keeping last N hash keys instead of just one might be even easier -- just a matter of removing/keeping more files
?
- reading might be faster in sqlite, although considering we're storing variable length column probably can't be that much faster? Although sqlite probably keeps an index, so it doesn't need to find newlines like reading from files would. But probably need to measure
Hmm.
Somewhat related https://github.com/jsonpickle/jsonpickle Although it's more like a regular pickle, not sure
In [6]:
class ObjWithDate(object):
def __init__(self):
ts = datetime.datetime.now()
self.data = dict(a='a', ts=ts)
self.data_ref = dict(b='b', ts=ts)
In [7]: jsonpickle.encode(ObjWithDate())
Out[7]: '{"py/object": "__main__.ObjWithDate", "data": {"a": "a", "ts": {"py/object": "datetime.datetime", "__reduce__": [{"py/type": "datetime.datetime"}, ["B+UEEAc6MQuycA=="]]}}, "data_ref": {"b": "b", "ts": {"py/id": 2}}}'
Wanted to simplify cachew a bit by potentially switching to third party serialization -- tried marshmallow/pydantic/dataclasses_json/cattrs (might post some notes later).
Basically for one reason or another they are not working properly for this case (usually some issues with Union support, or having to inherit your classes from some sort of base model).
So decided to optimize the one we have instead, and as an experiment tried just serializing as json in a single column (instead of flattening it over multiple like we do currently)
And it actually seems quite a bit faster (and the code is much easier to read!), not sure why I settled on the flattening back then, perhaps wasn't benchmarking properly.
test_nosql.py::test_many_from_cachew
building 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.53s
serializing 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 1.00s
json dump 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.51s
sqlite dump 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 1.16s
sqlite load 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.31s
jsonl dump 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.15s
jsonl load 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.21s
json load 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.52s
deserializing 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 1.29s
for comparison, current cachew stats for this test
- reading from db 1.21s -- corresponds to "sqlite load", note that sqlalchemy in cachew introduces some slowdown too which we should address
- deserializing 3.29 -- this roughtly a sum of "json load + deserializing" -- so about a 50% speedup
We won't necessarily lose on individual column mappings either, it's possible to define virtual columns for accessing inside JSON like described here https://antonz.org/json-virtual-columns/
Using a text file with json (jsonl) instead of sqlite also gives quite a big speedup (especially for writing) -- so I might think about switching to it later as well (or adding as an extra 'backend')
Added benefit would be that cache could be easily compressed as well, so might be easier to share between computers.
For now it's just an isolated experiment here https://github.com/karlicoss/cachew/tree/nosql, but I'll transpant to cachew soon
Wrote up a comparison here, might be interesting for other projects as well https://github.com/karlicoss/cachew/blob/master/doc/serialization.org
Actually cattrs seems very close to my new implementation, tiny bit faster for serializing, and tiny bit slower for deserializing -- so I'll need to do a bit more testing.
should be fixed in the latest pypi version!