Add support for creating indices on arbitrary database tables
Creating indexes turned out to be relatively straightforward! While we can’t remove them yet, that doesn’t seem necessary for now. Interestingly, much of the infrastructure for database additions was already in place.
- This PR introduces a new type,
Index, which can be used to create an index on any defined table. :tada: - The
itemstable now automatically registers an index onalbum_id
Closes #5809
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 67.96%. Comparing base (5393127) to head (f7ddcde).
:white_check_mark: All tests successful. No failed tests found.
Additional details and impacted files
@@ Coverage Diff @@
## master #5926 +/- ##
==========================================
+ Coverage 67.92% 67.96% +0.03%
==========================================
Files 137 137
Lines 18673 18694 +21
Branches 3154 3155 +1
==========================================
+ Hits 12684 12705 +21
Misses 5324 5324
Partials 665 665
| Files with missing lines | Coverage Δ | |
|---|---|---|
| beets/dbcore/__init__.py | 100.00% <100.00%> (ø) |
|
| beets/dbcore/db.py | 94.71% <100.00%> (+0.19%) |
:arrow_up: |
| beets/library/models.py | 87.20% <100.00%> (+0.02%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
I just stumbled upon this function in the same module:
def _make_attribute_table(self, flex_table: str):
"""Create a table and associated index for flexible attributes
for the given entity (if they don't exist).
"""
with self.transaction() as tx:
tx.script(
"""
CREATE TABLE IF NOT EXISTS {0} (
id INTEGER PRIMARY KEY,
entity_id INTEGER,
key TEXT,
value TEXT,
UNIQUE(entity_id, key) ON CONFLICT REPLACE);
CREATE INDEX IF NOT EXISTS {0}_by_entity
ON {0} (entity_id);
""".format(flex_table)
)
I didn't realise we've already been creating indices for attributes tables!
As you mentioned, consistency is important, and it seems that now we use two methods to create indices. Looking at the simplicity of the above command, I'm now questioning whether we really need to depend on the entire complexity of checking what indices we already have in the database. Can't we get away with the following instead?
diff --git a/beets/dbcore/db.py b/beets/dbcore/db.py
index 2e0df3fec..a54dabd7f 100755
--- a/beets/dbcore/db.py
+++ b/beets/dbcore/db.py
@@ -1159,10 +1159,11 @@ def load_extension
# Schema setup and migration.
- def _make_table(self, table: str, fields: Mapping[str, types.Type]):
+ def _make_table(self, model_cls: Model):
"""Set up the schema of the database. `fields` is a mapping
from field names to `Type`s. Columns are added if necessary.
"""
+ table, fields = model_cls._table, model_cls._fields
# Get current schema.
with self.transaction() as tx:
rows = tx.query("PRAGMA table_info(%s)" % table)
@@ -1192,6 +1193,14 @@ def _make_table
table, name, typ.sql
)
+ create_idx_cmds = [
+ f"CREATE INDEX IF NOT EXISTS {i.name} ON {table} ({', '.join(i.columns)});"
+ for i in model_cls._indices
+ ]
+
+ setup_sql += "\n".join(create_idx_cmds)
+
with self.transaction() as tx:
tx.script(setup_sql)
I didn't realise we've already been creating indices for attributes tables!
I have seen that when creating the indices. For me the whole set of attribute tables seem a bit inconsistent in the first place as they feel like a bolt-on rather than a first-class part of the schema, more like a key/value store glued on top of the ORM than something that plays nicely with the rest of the core tables. Long term, I think they’d probably benefit from a bigger refactor to align with the rest of the core layer. That’s also why I originally ignored it :upside_down_face:
From what I understand, the core was originally meant to function like a lightweight ORM, with an eye toward adding more tables and features to beets in the future. If we only care about functionality, we could simplify things quite a bit in that layer. Personally, though, I’m fine with a bit of added complexity here since it fits with the original ORM idea.
The advantage of this approach is that it gives us a path to evolve the schema more cleanly, e.g. adding or migrating indices in a structured way, which isn’t really possible with your proposal. Once created they stay if we don't add additional logic. For me this comes down to whether we want to stick with the ORM-like abstraction in the core, or move toward a leaner, purely functional approach.
I don’t have a huge preference either way. I do plan to work on some schema migration in the future, which might benefit from the ORM-style complexity in general, but you’re right that it’s not urgent right now. How do you see this in the bigger picture?
See this blog post for more context regarding flexible attributes: https://beets.io/blog/flexattr.html. Certain implementation details can potentially be optimised, but the core concept is going to stay, I think.
How do you see this in the bigger picture?
I don't think I see how this relates to the bigger picture. I'm aware of the requirement:
we need to index items.album_id to optimise the lookup performance.
And I want to see the simplest solution that satisfies this requirement but doesn't create any tech debt and is easy to maintain.
See this blog post for more context regarding flexible attributes: https://beets.io/blog/flexattr.html. Certain implementation details can potentially be optimised, but the core concept is going to stay, I think.
I don’t disagree with having an attribute table in general I think you got that wrong, they’re useful as a key–value store. My concern is just that they shouldn’t be treated differently from “normal” tables in the core. A table is a table, regardless of its content. Especially now that we’ve also introduced indices on the albums table, the differences are even smaller. Treating them more uniformly might allow us to simplify some logic quite a bit.
I don't think I see how this relates to the bigger picture. I'm aware of the requirement:
I’m just trying to explore improvements here. For example, we could reuse the index data object on the attribute tables (and similarly in the migration function), or apply the same, or a similar, approach for field migrations in the future. That said, this doesn’t really fit into this PR, so I’m fine taking it one step at a time. If you prefer not to change existing code, that’s fine, but I like to maintain a codebase that I feel confident and happy working with. Adding features is much easier when the groundwork is set properly.
And I want to see the simplest solution that satisfies this requirement but doesn't create any tech debt and is easy to maintain
If the absolute main goal is just to optimize items.album_id lookup performance, then sure, we could hardcode the migration. However, this would reduce flexibility, make testing and reuse harder, and increase maintenance effort, essentially creating technical debt over time. There’s definitely a trade-off here.
I'm largely fine with the changes but since this affects one of the key pieces of the code base I think we'd benefit from hearing what @wisp3rwind thinks about it as well.
@snejus @wisp3rwind
I would like to revive this and merge it if we still think the changes are fine. I had another look here and still like the approach. I also run a micro benchmark against using tx.script directly (without the Index class) and there does not seem to be any difference.
------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_index_raw_benchmark 29.5750 (1.0) 117.1070 (1.10) 31.6931 (1.00) 5.5332 (1.05) 30.4920 (1.0) 0.4530 (1.06) 913;2520 31.5526 (1.00) 22167 1
test_index_recreate_benchmark 29.6570 (1.00) 106.1600 (1.0) 31.5888 (1.0) 5.2643 (1.0) 30.5230 (1.00) 0.4260 (1.0) 397;1197 31.6568 (1.0) 11565 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I tested this locally and discovered a performance issue with the current implementation. It queries the database separately for each existing index on the table (N+1 problem):
$ BEETS_DEBUG=1 beet --plugins= version
sql
└── PRAGMA table_info(items)
sql
└──
CREATE TABLE IF NOT EXISTS item_attributes
(id INTEGER
PRIMARY KEY,
entity_id INTEGER, key TEXT, value TEXT, UNIQUE(entity_id, key) ON CONFLICT REPLACE);
CREATE INDEX IF NOT EXISTS item_attributes_by_entity ON item_attributes (entity_id);
sql
└── PRAGMA index_list(items)
sql
└── PRAGMA index_info(idx_item_album_id)
sql
└── PRAGMA index_info(items_album_id_idx)
sql
└── PRAGMA index_info(items_by_album)
sql
└── PRAGMA table_info(albums)
sql
└──
CREATE TABLE IF NOT EXISTS album_attributes
(id INTEGER
PRIMARY KEY,
entity_id INTEGER, key TEXT, value TEXT, UNIQUE(entity_id, key) ON CONFLICT REPLACE);
CREATE INDEX IF NOT EXISTS album_attributes_by_entity ON album_attributes (entity_id);
sql
└── PRAGMA index_list(albums)
sql
└── SELECT items.*
FROM (items)
WHERE id = 0
GROUP BY items.id
sql
└── SELECT *
FROM item_attributes
WHERE entity_id IN
(SELECT id
FROM
(SELECT items.*
FROM (items)
WHERE id = 0
GROUP BY items.id))
beets version 2.5.1
Python version 3.10.18
no plugins loaded
Notice how it executes PRAGMA index_list(items) followed by a separate PRAGMA index_info(...) for each existing index. This overhead happens on every beets command invocation.
I apologize for missing this during review - the implementation details weren't immediately obvious. This is a good reminder why I advocate for keeping things simple when our needs are straightforward. Given that we're adding a single index, could we use the simpler approach that avoids these extra queries? SQLite's CREATE INDEX IF NOT EXISTS handles the existence check efficiently at the engine level:
diff --git a/beets/dbcore/db.py b/beets/dbcore/db.py
index 0a3477371..db3430008 100755
--- a/beets/dbcore/db.py
+++ b/beets/dbcore/db.py
@@ -1088,9 +1088,9 @@ def __init__
# Set up database schema.
for model_cls in self._models:
- self._make_table(model_cls._table, model_cls._fields)
+ self._make_table(model_cls)
self._make_attribute_table(model_cls._flex_table)
- self._migrate_indices(model_cls._table, model_cls._indices)
+ # self._migrate_indices(model_cls._table, model_cls._indices)
# Primitive access control: connections and transactions.
@@ -1217,21 +1217,16 @@ def load_extension
conn.load_extension(path)
# Schema setup and migration.
-
- def _make_table(self, table: str, fields: Mapping[str, types.Type]):
+ def _make_table(self, model_cls: type[Model]):
"""Set up the schema of the database. `fields` is a mapping
from field names to `Type`s. Columns are added if necessary.
"""
+ table, fields = model_cls._table, model_cls._fields
# Get current schema.
with self.transaction() as tx:
rows = tx.query(f"PRAGMA table_info({table})")
current_fields = {row[1] for row in rows}
- field_names = set(fields.keys())
- if current_fields.issuperset(field_names):
- # Table exists and has all the required columns.
- return
-
if not current_fields:
# No table exists.
columns = []
@@ -1249,8 +1244,16 @@ def _make_table
f"ALTER TABLE {table} ADD COLUMN {name} {typ.sql};\n"
)
- with self.transaction() as tx:
- tx.script(setup_sql)
+ create_idx_cmds = [
+ f"CREATE INDEX IF NOT EXISTS {i.name} ON {table} ({', '.join(i.columns)});"
+ for i in model_cls._indices
+ ]
+
+ setup_sql += "\n".join(create_idx_cmds)
+
+ if setup_sql:
+ with self.transaction() as tx:
+ tx.script(setup_sql)
def _make_attribute_table(self, flex_table: str):
"""Create a table and associated index for flexible attributes
This eliminates the PRAGMA queries entirely:
$ BEETS_DEBUG=1 beet --plugins= version
sql
└── PRAGMA table_info(items)
sql
└── CREATE INDEX IF NOT EXISTS idx_item_album_id ON items (album_id);
sql
└──
CREATE TABLE IF NOT EXISTS item_attributes
(id INTEGER
PRIMARY KEY,
entity_id INTEGER, key TEXT, value TEXT, UNIQUE(entity_id, key) ON CONFLICT REPLACE);
CREATE INDEX IF NOT EXISTS item_attributes_by_entity ON item_attributes (entity_id);
sql
└── PRAGMA table_info(albums)
sql
└──
CREATE TABLE IF NOT EXISTS album_attributes
(id INTEGER
PRIMARY KEY,
entity_id INTEGER, key TEXT, value TEXT, UNIQUE(entity_id, key) ON CONFLICT REPLACE);
CREATE INDEX IF NOT EXISTS album_attributes_by_entity ON album_attributes (entity_id);
sql
└── SELECT items.*
FROM (items)
WHERE id = 0
GROUP BY items.id
sql
└── SELECT *
FROM item_attributes
WHERE entity_id IN
(SELECT id
FROM
(SELECT items.*
FROM (items)
WHERE id = 0
GROUP BY items.id))
beets version 2.5.1
Python version 3.10.18
no plugins loaded
This approach also maintains consistency with how we already handle indices for the attribute tables. We can always revisit the abstraction layer if we need more sophisticated migration capabilities in the future, but for now this solves the immediate performance need without the overhead.
I was well aware of this and it was intended! It is a migration system by design, we need to get the existing indices to check if we need to recreate them if they changed. This is the same approach we take for table migration so it is consistent with the current implementations.
The PRAGMA index_list and index_info have essentially no overhead as they just retrieve the schema metadata. On my machine the overhead here it is negligible (~2us per table).
Also the number of indices is limited, thus it is not really this much of a scaling issue as you made it out to be here...
I appreciate the clarification on the intent. A few thoughts:
On the migration system design:
I understand this mirrors the table migration approach, and I can see the code does drop and recreate indices when they differ from what's defined. However, there's a key difference in how we'll actually use this: column definitions change over time (new fields get added, types evolve), but the indices we're defining have unique, stable names. Once idx_item_album_id is created, it won't change - we're not going to redefine it with different columns or rename it.
So while the migration capability exists, we're unlikely to use it in practice. Indices aren't like columns where we need to detect and handle schema evolution - we define them once with unique names and they stay that way. This means we're paying the query cost (PRAGMA index_list + N × PRAGMA index_info) on every beets invocation for migration detection we won't actually need.
On the performance impact:
You're right that individual PRAGMA calls are fast. My concern isn't about absolute numbers but about principle: we're doing work (N+1 queries) that CREATE INDEX IF NOT EXISTS does internally in one step. It's not about catastrophic slowdowns - it's about avoiding unnecessary work when SQLite already solves this for us.
On consistency:
I actually see this as an opportunity to improve consistency. The flexible attribute tables already use CREATE INDEX IF NOT EXISTS directly. Rather than making the main tables more complex to match the column migration pattern, we could align with the simpler pattern we already use successfully.
My ask:
Given that:
- Indices have unique names and won't need migration once created
CREATE INDEX IF NOT EXISTSis simpler and used elsewhere in the codebase- The immediate goal is adding one index to improve query performance
Could we start with the simpler approach and add migration infrastructure if/when we have concrete migration requirements? I'm genuinely trying to avoid premature complexity, not dismiss the value of your work.
Could we start with the simpler approach and add migration infrastructure if/when we have concrete migration requirements? I'm genuinely trying to avoid premature complexity, not dismiss the value of your work.
This is a longterm goal with beets for me. I want to unify naming in the database and properly normalize it. This will also require changes to the indices. There are also some alternative valid approaches to a migration systems but I will require this in some form!
My concern is mostly about workflow and reviewability. If I drop a massive overhaul PR with thousands of line changes and additions in the core database module, nobody will realistically be able to review it. I’m trying hard to avoid that situation by breaking things into small, comprehensible steps. To do that, I need a way to land incremental additions even if the features are not fully used yet, otherwise the only alternative is a huge “all-or-nothing” patch that people won’t want to look at. We/I need a way to make these additions without it being dismissed as premature complexity...