Kobo-UNCaGED icon indicating copy to clipboard operation
Kobo-UNCaGED copied to clipboard

Strategies for large libraries

Open shermp opened this issue 3 years ago • 5 comments

The user bigwoof on MobileRead has run into issues using KU with a large book library, and it's brought to light that KU as released really is not very memory efficient. And that even when one tries to improve memory usage efficiency, holding the entire calibre metadata set in memory can be problematic.

I've been trying to think of strategies to deal with this, and these are the ideas I've come up with so far:

  • Don't bother with calibre metadata. Just send Calibre whatever we have available in Nickel's DB. Simple to implement, probably the most efficient. Downside is not keeping the metadata.calibre file in sync with the calibre kobo driver.
  • Store the metadata from calibre in some sort of file-based kv store. And maybe sync that store with metadata.calibre?
  • Similar to above, but use an SQLite DB with proper columns to store metadata.
  • Find a way of indexing/accessing JSON directly from file

I'm really open to all ideas.

Paging @NiLuJe and @pgaskin and @pazos for ideas.

shermp avatar Jan 03 '21 03:01 shermp

Don't bother with calibre metadata. Just send Calibre whatever we have available in Nickel's DB. Simple to implement, probably the most efficient. Downside is not keeping the metadata.calibre file in sync with the calibre kobo driver.

I would go with that one. After all nickel doesn't use metadata.calibre at all.

The plugin we use on KOReader discards most of the info that calibre streams on each new book. The rationale is: keep the bare minimum info to tell calibre on the next connection and a few fields useful for metadata lookups (title, authors, tags, series, series index). I think most of the junk that you hold in memory are base64 thumbnails and user columns.

That way is possible to keep track of thousands of books in memory without too much trouble. The file is dumped to a json file on each change, but that's just because it is needed for the "search on calibre metadata" function. If we didn't need that I guess that any binary format would be faster.

pazos avatar Jan 03 '21 20:01 pazos

Yeah, if I do this, probably the only extra metadata I'd keep would be the Calibre UUID and maybe last modified date/time, as those are what's sent with the "book count" list.

shermp avatar Jan 03 '21 20:01 shermp

I'm not totally familiar with how the metadata code works or when the file is manipulated, but you could try using a streaming JSON parser and keeping an index in the JSON for read operations (maybe with a caching layer if you read the same thing often), then making an in-memory log of pending updates and write them all at once. Alternatively, a database mirroring the Calibre metadata file and kept in sync with it (regenerating the Calibre metadata file when needed) would be another option, but I would probably avoid this unless absolutely necessary due to the possible race conditions and bugs.

pgaskin avatar Jan 03 '21 20:01 pgaskin

There are actually very few times when the full metadata is actually used. The JSON indexing idea is definitely something I've been thinking about. Do you know of a streaming decoder that can do this? I don't think it can be done with encoding/json.

shermp avatar Jan 03 '21 20:01 shermp

There are actually very few times when the full metadata is actually used. The JSON indexing idea is definitely something I've been thinking about. Do you know of a streaming decoder that can do this? I don't think it can be done with encoding/json.

Doh, helps to RTFM.

Decoder.InputOffset looks to be what I need to build an index.

shermp avatar Jan 03 '21 21:01 shermp