sqlitedict icon indicating copy to clipboard operation
sqlitedict copied to clipboard

Cache size to avoid delays on big db files

Open twall opened this issue 4 years ago • 3 comments
trafficstars

I use mostly read-only dbs, so the time spent sizing them is wasted. Even in the case of writable dbs, caching the last known size and timestamp and comparing against the db file mtime can be of use.

Something like this:

@property
def _cached_size(self):
    tablename = "meta"
    GET_ITEM = f'SELECT value FROM "{tablename}" WHERE key = ?'
    size = self._select_one(GET_ITEM, ("size",))
    ts = self._select_one(GET_ITEM, ("ts",))
    fts = os.path.getmtime(self.filename)
    return ts > fts and size

@_cached_size.setter
def _cached_size(self, size):
    tablename = "meta"
    MAKE_TABLE = f'CREATE TABLE IF NOT EXISTS "{tablename}" (key TEXT PRIMARY KEY, value BLOB)'
    self.conn.execute(MAKE_TABLE)
    ADD_ITEM = f'REPLACE INTO "{tablename}" (key, value) VALUES (?,?)'
    self.conn.execute(ADD_ITEM, ("size", self.encode(size)))
    ADD_ITEM = f'REPLACE INTO "{tablename}" (key, value) VALUES (?,?)'
    # Since file modtime will be updated _after_ writing the value, add some slack
    delay = 0.5
    while True:
        ts = time.time() + delay
        self.conn.execute(ADD_ITEM, ("ts", self.encode(ts)))
        self.conn.commit()
        if os.path.getmtime(self.filename) < ts:
            break
        delay += 0.5

twall avatar Dec 02 '20 16:12 twall

Interesting. Can you expand the motivation? In I use mostly read-only dbs, so the time spent sizing them is wasted., what exactly is wasted, how much of it, how does it affect you, how do you expect the fix will affect others?

Adding an extra meta table to help with caching is possible, but it's additional complexity so needs to be well motivated (and documented).

piskvorky avatar Dec 02 '20 17:12 piskvorky

I'm using this as a cache for meta-information from a much larger ES database (the difference in access time is orders of magnitude). Once the ES db is established, it's effectively read-only, but I need to generate a number of smaller, derived databases (in redis) using different subsets of data. These smaller dbs are generated as-needed over a longer period of time.

When I go to use the cache, I need to ensure it's in sync with the ES database/index (thus checking the size), but I also want to have the full size to pass to tqdm for progress meters. Given the size of the sqlite db, getting the length can take 4-5 minutes, which is painful when I'm using more than one of these representing more than one ES index.

On Wed, Dec 2, 2020 at 12:32 PM Radim Řehůřek [email protected] wrote:

Interesting. Can you expand the motivation? In I use mostly read-only dbs, so the time spent sizing them is wasted., what exactly is wasted, how much, how does it affect you, how do you expect it will affect others?

Adding an extra meta table to help with caching is possible, but it's additional complexity so needs to be well motivated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/sqlitedict/issues/125#issuecomment-737381612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYZLOPOVRRV67OM3NDY43SSZ23RANCNFSM4UKYMLFQ .

twall avatar Dec 02 '20 17:12 twall

ES => subset of data into sqlitedict => process data into a number of different redis dbs ES and sqlitedict are modified quite rarely in comparison to the redis dbs.

For the moment, I'm just using a subclass, which is fine (it also implements a non-threaded model to improve error handling and general performance). I'm also using msgpack for serialization, which is much, much faster (and smaller) than pickle.

On Wed, Dec 2, 2020 at 12:49 PM Timothy Wall [email protected] wrote:

I'm using this as a cache for meta-information from a much larger ES database (the difference in access time is orders of magnitude). Once the ES db is established, it's effectively read-only, but I need to generate a number of smaller, derived databases (in redis) using different subsets of data. These smaller dbs are generated as-needed over a longer period of time.

When I go to use the cache, I need to ensure it's in sync with the ES database/index (thus checking the size), but I also want to have the full size to pass to tqdm for progress meters. Given the size of the sqlite db, getting the length can take 4-5 minutes, which is painful when I'm using more than one of these representing more than one ES index.

On Wed, Dec 2, 2020 at 12:32 PM Radim Řehůřek [email protected] wrote:

Interesting. Can you expand the motivation? In I use mostly read-only dbs, so the time spent sizing them is wasted., what exactly is wasted, how much, how does it affect you, how do you expect it will affect others?

Adding an extra meta table to help with caching is possible, but it's additional complexity so needs to be well motivated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/sqlitedict/issues/125#issuecomment-737381612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYZLOPOVRRV67OM3NDY43SSZ23RANCNFSM4UKYMLFQ .

twall avatar Dec 02 '20 17:12 twall