beets icon indicating copy to clipboard operation
beets copied to clipboard

Use SQL to query flex fields, and related Album/Item data

Open snejus opened this issue 2 years ago • 9 comments

Description

This PR depends on #4741 and #4744 which convert some queries to be SQL-compatible.

In short, it enables filtering flexible attributes and related items or album through SQL. This speeds up flexible attributes lookup and enables queries like

beet list album_flex:something  # list all items under the albums that have album_flex:something
beet list -a item_flex:something  # list all albums which have at least one item with item_flex:something

This is largely the same logic that is behind #4362. I realized that the original PR ended up too complicated with too much of not-so-strictly-required refactorings which makes it hard to review. This PR skips the noise and contains only the bits that are required.


Note: I will be adding basic tests for the new behaviour within the next days. It'd be good to hear what sort of expectations people have here - happy to test them too!

Right before submitting the PR I stumbled upon a limitation: a query with more than one flexible attribute filter gives empty results.

  • Completely forgot to take this into account - the way items and, say item_attributes are currently joined yields a row for each items and item_attributes combination, something like (simplified)

    items.id item_attributes.key item_attributes.value
    1 skip_count 10
    1 play_count 5

    When we try to filter using a single flexible attribute, one of these rows will match (if found, of course) so it works as expected.

    If we try to filter by two, we obviously fail. This can be tackled by (1) aggregating the attribtues or (2) performing EXISTS check within a subquery. Will test whichever is more performant and will add it in the next commit.

To Do

  • [ ] Documentation. (If you've add a new command-line flag, for example, find the appropriate page under docs/ to describe it.)
  • [ ] Changelog. (Add an entry to docs/changelog.rst near the top of the document.)
  • [x] Tests. (Encouraged but not strictly required.)

snejus avatar Apr 08 '23 18:04 snejus

Neato!! This is very exciting; I'm really glad we (as a project) are finally getting around to getting serious about making flexattr queries fast. When we built flexattr support long ago, I was pretty happy with the "slow query" mechanism as a way to make functionality progress without worrying about performance, but I always knew we'd want to come back and make them actually work in SQL through some combination of query cleverness and schema updates. We're doing that at last!

Thanks to your blog post which gave some nice pointers about what needs improving here! I do completely agree about what does not need improving here :)

Here's one somewhat low-level question: could we do this with a nested select instead of two SQL queries? SQLite supports expressions like item.id IN (SELECT ...). I haven't used this "subquery" functionality much so I don't know if it fits the bill, but it would be cool if we could avoid fetching all the IDs, stringifying them, and then feeding them back into the second-stage query syntax.

It may be possible - this would also offer an additional performance gain. I'll have a look how does it play together with having to query two flexible attribute tables! I guess those can be UNIONed. :)

snejus avatar Apr 10 '23 10:04 snejus

@sampsyo see the new commits: aggregation ended up significantly simplifying the machinery behind the new queries - this now uses just one query.

We can now query multiple flex attributes!

The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow beet list -a title:hello or beet list artpath:art_path. Though, I think for now this should be OK - we'd in any case need to have a wider discussion about how queries like beet list -a play_count:10 should work - do we here check play_count for every item, or do we sum/aggregate them?

There are some tests failing that have to do with the types - I'll have a deeper look at them later today/tomorrow.

snejus avatar Apr 11 '23 13:04 snejus

That is awesome!!!!

Regarding the issue of "related" queries, it seems to me like these run in opposite directions:

  • beet list -a title:hello: Querying albums based on item metadata. I actually would have guessed that this currently would not work, unless there was an album out there with a flexattr called title (which would be weird). But maybe I am misremembering what we support here?
  • beet list artpath:art_path: Querying items based on album metadata. This actually seems more common, and we added support for doing stuff like beet list albumartist:Beatles in #2988. It would be great to avoid breaking this, although https://github.com/beetbox/beets/discussions/4742#discussioncomment-5556090 also points toward a larger change that would obviate this kind of "item-to-album fallback."

sampsyo avatar Apr 18 '23 00:04 sampsyo

@sampsyo I apologize - I used completely wrong examples in my previous comment 🤦🏽‍♂️ see the clarification below

The drawback is that we're losing the ability to query related flexible attributes - for example, the simplicity here does not allow ~beet list -a title:hello~ beet list -a item_flex_attr:hello or ~beet list artpath:art_path~ beet list album_flex_attr:hello.

In summary,

  • ✅ related model field lookups are supported
    • beet list -a title:hello
    • beet list artpath:hello
  • ⛔ related flex attr lookups are not supported
    • beet list -a item_flex_attr:hello
    • beet list album_flex_attr:hello

snejus avatar Apr 18 '23 08:04 snejus

Just had a couple of runs comparing master to this branch with various queries, see the results below

image

  • Previously, the more filters were provided on the command line, the longer would the query take. Now, the behavior is the opposite - every argument either speeds up the query or keeps it the same.
  • You can see master ran -a title:a very quickly - that's because it didn't return anything.

The new queries have a more or less constant speed; the command speed is largely determined by the number of things that need printing. See the table below for the summary of timings and the numbers of matched results. You can see that albums query tends to be slightly quicker than items.

image

snejus avatar Apr 18 '23 10:04 snejus

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 17 '23 01:09 stale[bot]

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 17 '24 13:03 stale[bot]

@sampsyo would you still want this to be merged? I could try and rebase it off master

snejus avatar Mar 19 '24 03:03 snejus

Improving flexible field queries would be huge... right now they are painfully slow for a large library.

arsaboo avatar Mar 19 '24 11:03 arsaboo

Superseded by #5240

snejus avatar May 09 '24 11:05 snejus