Quality Check - Performance
I have a large library with c. 30,000 books in it. Quality Check can take quite a long time to run through these (e.g. check author sort).
I am wondering whether there is scope for some performance improvements. Here is the code for check_author_sort_valid:
def check_author_sort_valid(self):
def evaluate_book(book_id, db):
current_author_sort = db.author_sort(book_id, index_is_id=True)
authors = db.authors(book_id, index_is_id=True)
if not authors:
return True
authors = [a.strip().replace('|', ',') for a in authors.split(',')]
if current_author_sort != db.author_sort_from_authors(authors):
return True
return False
self.check_all_files(evaluate_book,
no_match_msg=_('All searched books have a valid Author Sort'),
marked_text='invalid_author_sort',
status_msg_type=_('books for invalid author sort'))
In essence this loops over each book the library and makes 3 db calls (which are presumably against a memory cached copy of the database).
Obviously the code in a plugin is limited by the Calibre API, but I am wondering whether the following optimisations might be possible (using this example, but presumably possible much more generally in this plugin):
-
Making fewer DB calls for each book (in
evaluate_book) - here we make 3 calls todb, to get the book's authors, the book's author_sort and the default author_sort for each of the authors. Would it be possible to do this in fewer calls e.g. by getting the books' metadata object and then getting the authors and author_sort from that, and if so would it be any quicker (I suspect marginal)? -
Start the entire check by doing a single SQL query using the books table and the authors table to get the book_ids i.e. something like:
SELECT book_id FROM books, book_authors, authors WHERE books.book_id = book_authors.book_id AND book_authors.author_id = authors.author_id AND book_authors.author_sort <> authors.author_sortand then just loop through the results to mark these books for the UI. I suspect that if this was possible, it would be much much quicker.
If the Calibre db API doesn't have this sort of functionality, could you persuade Kovid to develop this sort of raw interface or to accept a PR for it?
That SQL won't do what you want, failing if a book has more than one author. Quality check compares the existing author_sort with an author_sort generated from the list of authors. I don't see how to write a reliable SQL statement that does that. I'm also quite sure that Kovid wouldn't be interested in adding such a specific API, but I might be wrong.
Here is an example of how to do it with current APIs. It is a calibre python template that builds a cache then tests the selected books against the cache. The plugin would of course generate the cache once then loop through the books in the database.
python:
def evaluate(book, context):
# Cache the data
# Use the "new" calibre db api
db = context.db.new_api
# Get all the book ids in the database
all_book_ids = db.all_book_ids()
# Get the computed author sort string for all books
computed_sort_string_for_books = db.author_sort_strings_for_books(all_book_ids)
# Get the actual author sort string for all books
author_sort_string_for_books = db.all_field_for('author_sort', all_book_ids)
# Check a book
# print(' & '.join(computed_sort_string_for_books[book.id]), '***', author_sort_string_for_books[book.id])
return str(' & '.join(computed_sort_string_for_books[book.id]) != author_sort_string_for_books[book.id])
It would be interesting to see if this implementation is significantly faster. Filling the cache might be as expensive as the existing code. I don't think so, but I don't know so.
@cbhaley thanks for the suggested code on this. I (finally) found some time to give this a try and it is significantly more performant. On a 50K book library, the old code takes 3 mins, the new version takes 15 seconds. So definitely worth the effort.
Will be fixed in 1.14.5