Missing Indexes in SQLite Database Causing Performance Bottlenecks
I've noticed that certain queries on the SQLite database used by beets—particularly album.items()—are quite slow on my machine.
After some investigation, I found that the database does not currently have any indexes defined. Specifically, the items table lacks an index on the album_id column, which seems to be the root cause of the slowdown when querying items by album.
I manually created an index on album_id, and the performance improvement was significant—query times dropped by approx a factor of 4 in my benchmarks.
Suggestion
Adding indexes for frequently queried fields (like album_id) could provide a substantial performance boost for many users. However, I'm not very familiar with the dbcore layer and I'm unsure how or where to properly define these indexes within the beets codebase. I dont think we have a mechanism in place for applying any kind of database migrations.
Temporary Workaround
For anyone else running into similar performance issues, or if you just want to test this, here's a small script you can use to manually add the index and run a benchmark on your library.
import sqlite3
import time
from pathlib import Path
from beets.library import Library
# Configure your db path
db_path = Path("./library.db")
idx_name = "items_album_id_idx"
def run_sql_exc(path, query):
conn = sqlite3.connect(path)
cursor = conn.cursor()
cursor.execute(query)
conn.commit()
conn.close()
def run_benchmark(lib, indexed=True):
time_start = time.perf_counter()
albums = lib.albums()
items = []
for album in albums:
items.extend(album.items())
duration = time.perf_counter() - time_start
nA = len(albums)
nI = len(items)
print(
f"Benchmark with{'out' if indexed else None} index took {duration:.2f} seconds"
)
print(f"\tnAlbums: {nA} (avgtime {duration * 1000 / nA} ms per matchquery)")
print(f"\tnItems: {nI} (avgtime {duration * 1000 / nI:.2f} ms per item)")
if __name__ == "__main__":
run_sql_exc(db_path, 'DROP INDEX IF EXISTS "{}"'.format(idx_name))
print("Benchmark without index...")
run_benchmark(Library(db_path.name, str(db_path.resolve().parent)))
print("Creating index on items.album_id...")
run_sql_exc(
db_path, 'CREATE INDEX IF NOT EXISTS "{}" ON items (album_id)'.format(idx_name)
)
print("Benchmark with index...")
run_benchmark(Library(db_path.name, str(db_path.resolve().parent)))
print("Done.")
For me adding a singular index on items yields the following improvements:
Benchmark without index...
Benchmark without index took 2.27 seconds
nAlbums: 521 (avgtime 4.364256262957434 ms per matchquery)
nItems: 3977 (avgtime 0.57 ms per item)
Creating index on items.album_id...
Benchmark with index...
Benchmark with index took 0.66 seconds
nAlbums: 521 (avgtime 1.2675718119012231 ms per matchquery)
nItems: 3977 (avgtime 0.17 ms per item)
Done.
Significant orders of magnitude difference with a larger database.
Benchmark without index...
Benchmark without index took 1787.29 seconds
nAlbums: 1911 (avgtime 935.2657366193117 ms per matchquery)
nItems: 19586 (avgtime 91.25 ms per item)
Creating index on items.album_id...
Benchmark with index...
Benchmark without index took 13.31 seconds
nAlbums: 1911 (avgtime 6.9631751098444425 ms per matchquery)
nItems: 19586 (avgtime 0.68 ms per item)
Done.
I think we can even add more indexes to optimize the join between [item|album]_attributes and [album|item] tables. We should be able to get even more improvements here.
Keep in mind tho that this benchmark is kinda arbitrary and might not translate to real world performance increases.
Significant orders of magnitude difference with a larger database.
Benchmark without index... Benchmark without index took 1787.29 seconds nAlbums: 1911 (avgtime 935.2657366193117 ms per matchquery) nItems: 19586 (avgtime 91.25 ms per item) Creating index on items.album_id... Benchmark with index... Benchmark without index took 13.31 seconds nAlbums: 1911 (avgtime 6.9631751098444425 ms per matchquery) nItems: 19586 (avgtime 0.68 ms per item) Done.
Oh wow, this is amazing!
@semohr
This is crazy; with indexes it runs about 809 times faster; in 0.12% of the original time with a 99.88% time reduction:
-
Without index: 14 hours, 38 minutes, 35 seconds
-
With index: 1 minute, 5 seconds
Benchmark without index...
Benchmark without index took 52714.58 seconds
nAlbums: 54977 (avgtime 958.8479613526935 ms per matchquery)
nItems: 669049 (avgtime 78.79 ms per item)
Creating index on items.album_id...
Benchmark with index...
Benchmark without index took 65.15 seconds
nAlbums: 54977 (avgtime 1.1850008060447745 ms per matchquery)
nItems: 669049 (avgtime 0.10 ms per item)
How would we go about adding this change into the beets code?
We would need some kind of migration system for the database schema change. In theory adding indexes is only a simple addition to the schema but a full fledged system to support arbitrary changes would be preferable.
Sadly we dont have anything inplace for this yet. Not even sure how we would add that in the db layer.
i*ve about 34 000 tracks and wihout the indexes i can see e.g. while using beets mv processed ~ 5min/file. With the index is ~300 times faster but still it takes far more than 1 week to move all files.
so currently my library is more or less broken, as it will basically never finish in a acceptable time. hdd is nvme
any sughgestions how i could sensible split the library to be able to operate with it again?
i*ve about 34 000 tracks and wihout the indexes i can see e.g. while using beets mv processed ~ 5min/file. With the index is ~300 times faster but still it takes far more than 1 week to move all files.
If it is 300 times faster it should be finished in less than half a day (5min/300 * 34000 = 0.39 days), seems acceptable to me. If you are on linux, file moves should be more or less instant if within the same file system. With a bit of napkin math and estimation I would assume 34000 tracks (50mb each) to be done in an hour top (at 500mb/s) . Needs a bit more investigation what is the bottleneck here is exactly.
What beets command are you running?
You might be able to triage the bottleneck by including timestamps in your debug logs: Here a git patch for that:
diff --git a/beets/ui/__init__.py b/beets/ui/__init__.py
index 12eb6d005..3bfae01f3 100644
--- a/beets/ui/__init__.py
+++ b/beets/ui/__init__.py
@@ -54,10 +54,18 @@ if sys.platform == "win32":
else:
colorama.init()
+from logging import Formatter
log = logging.getLogger("beets")
if not log.handlers:
- log.addHandler(logging.StreamHandler())
+ sh = logging.StreamHandler()
+ fmt = Formatter(
+ "%(asctime)s [%(filename)s:%(lineno)d] %(levelname)s: %(message)s",
+ datefmt="%Y-%m-%d %H:%M:%S",
+ )
+ sh.setFormatter(fmt)
+ log.addHandler(sh)
+
log.propagate = False # Don't propagate to root handler.