Make queries fast, filter all flexible attributes
- Make LazyClassProperty / cached_classproperty reusable
- Add support for filtering relations
- Add ability to debug queries
- Fix querying fields present in both tables
- Aggregate flexible attributes
- Add ability to filter flexible attributes through the Query
- Enable querying related flexible attributes
- Remove slow lookups from beetsplug/aura
Description
Another and (hopefully) final attempt to improve querying speed.
Fixes #4360 #3515 and possibly more issues to do with slow queries.
This PR supersedes #4746.
What's been done
The album and item tables are joined, and corresponding data from item_attributes and album_attributes is merged and made available for filtering. This enables to achieve the following:
- [x] Faster album path queries,
beet list -a path::some/path - [x] Faster flexible attributes queries, both albums and tracks,
beet list play_count:10 - [x] (New) Ability to filter albums with track-level (and vice-versa) db field queries,
beet list -a title:something - [x] (New) Ability to filter tracks with album-level flexible field queries,
beet list artpath:cover - [x] (New) Ability to filter albums with track-level flexible field queries,
beet list -a art_source:something
Benchmarks
You can see that now querying speed is more or less constant regardless of the query, and the speed is mostly influenced by how many results need to be printed out
Compare this with what we had previously
To Do
- [ ] Documentation. (If you've add a new command-line flag, for example, find the appropriate page under
docs/to describe it.) - [ ] Changelog. (Add an entry to
docs/changelog.rstnear the top of the document.) - [x] Tests. (Encouraged but not strictly required.)
Later
- [ ] Submit PR with the corresponding adjustment for sorting and fix for
lslimit - [ ] Submit PR with the corresponding adjustment for template variables resolution
Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.
I'm using the test-aura branch as the base since it depends on it getting merged. Though for now, I will change the base to master in order to run the tests.
I'll be able to review in a week or two, just end of semester push at the moment.
Hey, also just chiming in to say that it will take some time for me to go through the current batch of PRs. I won't be able to keep up the response times I had last week, but I'll slowly work through all of them.
I've left some comments on specific lines and things, but some general comments. I noticed that you changed field to col_name in the definition of FieldQuery which is reflected in bareasc. Do you think that this will have an impact on the compatibility with plugins more broadly? Going through the plugins with grep, I couldn't find any other instances but it is something to keep in mind. You haven't altered any of the interfaces in our docs so it isn't too much of an issue, but it will definitely have to be mentioned in the changelog.
I think it's quite clever how you merged all of the flexible attributes and optimised the SQL queries! This is a great PR, will really improve how beets works.
Btw, with those later PRs to do, I'd open bugs/enhancements for them so we don't forget!
Also, those benchmarks are great. We should add them as a poetry command I think, in another PR would probably be best. What have you used to make those?
I've left some comments on specific lines and things
Hm I'm not being able to see them! Did you submit the review?
...I did not! I have now.
I noticed that you changed field to col_name in the definition of FieldQuery which is reflected in
bareasc. Do you think that this will have an impact on the compatibility with plugins more broadly? Going through the plugins with grep, I couldn't find any other instances but it is something to keep in mind.
Well noted @Serene-Arc. I think I adjusted every internal plugin that used this functionality. Otherwise, I haven't seen any external plugins based on this interface.
You haven't altered any of the interfaces in our docs so it isn't too much of an issue, but it will definitely have to be mentioned in the changelog.
You're right, this completely needs a mention in the changelog!
Also, those benchmarks are great. We should add them as a poetry command I think, in another PR would probably be best. What have you used to make those?
This one's a bit complicated. 😅 In short,
- used
hyperfineto measure the performance between the two versions and save the results as JSON -
jqto handle the JSON data -
rich-tablesto render it -
zshas the glue between these parts
In detail
1. Prepare each executable for testing
- Install
beetsfrommasterbranch asbeetmaster: (in the repository)pipx install . --suffix master - Install
beetsfrom my branch asbeetfast:pipx install -e . --suffix fast.
Now one can run beetmaster and beetfast to test each version accordingly.
2. Prepare some queries for testing
I added the following contents to a file ./queries.txt. One query per line
artpath::a
artpath::a -a
path::a
path::a -a
art_source::a
art_source::a -a
title:a
title:a -a
play_count:5..
play_count:5.. -a
play_count:50..
play_count:50.. -a
art_source::a play_count::5 city:Be
art_source::a play_count::5 city:Be -a
art_source::. play_count:..5 city::.*
art_source::. play_count:..5 city::.* -a
3. Use hyperfine to measure the performance between the two
compare_performance() {
# args: <queries-file> <results-json-file>
# info: read queries from _queries-file, compare performance and save results to _results-json-file
# read queries
local queries_file=$1
local results_json_target=$2
local -a queries
queries=(${(f@)"$(grep -v "^#" $queries_file)"})
# compare performance of beetmaster and beetfast using hyperfine.
# For each of the given queries, hyperfine runs each command 5 times
# and exports the results to a json file.
hyperfine --parameter-list query ${(j:,:)queries} --runs 5 --ignore-failure \
--export-json $results_json_target \
'beetmaster -l /home/sarunas/.music/beets/library.db list {query}' \
'beetfast -l /home/sarunas/.music/beets/library.db list {query}'
}
4. Count how many items each command returns and insert it into the results
add_results_count() {
# args: <performance-results-json> <performance-with-results-count-json-target>
# info: count the number of results each command returns for each query, add
## this to the performance results json file and save it in the given target file
local performance_json=$1
local target_json=$2
local -a runs
runs=(${(f@)"$(jq '.results[].command' < $performance_json -r)"})
local run
rm results_count.txt
# save each command (which includes the executable and query) and the number of results it returned
for run in $runs; do
print "$run $($run | wc -l)" >> results_count.txt
done
# parse the results to JSON objects
jq '
capture("(?<command>.*) (?<results>[^ ]+)$")
| {(.command): (.results | tonumber)}
' -R <results_count.txt |
jq add -s > results_count.json
# add the results count to the performance results json
jq '.results |= map(.results_count = $index[.command])' \
--argjson index "$(<results_count.json)" \
<$performance_json >$target_json
}
5. Get relevant data from the results JSON and display them with rich-tables
show_hyperfine_results() {
# args: <results-json-file>
# info: print hyperfine results nicely
local -a json
zparseopts -D -E j=json
jq '
def mv(before; after): .[after] = .[before] | del(.[before]);
def rm(regex): gsub(regex; "");
def round(num): . * pow(10; num) | round / pow(10; num);
def group_map(field):
(field | split(",") | map(rm("^ *| *$"))) as $key
| sort_by(.[$key[]])
| group_by(.[$key[]])
| sort_by(length)
| (map({([.[0][$key[]]| if type == "array" then join(", ") end|tostring] | join(" | ")): map(del(.[$key[]]))}) | add);
def group_map_dict(field):
(field | split(",")) as $key
| sort_by(.[$key[]])
| group_by(.[$key[]])
| sort_by(length)
| map((.[0] as $item | $key | map({(.): $item[.]}) | add) + {"items": map(del(.[$key[]]) | select(len(.) > 0))});
.results
| map(
del(.times, .stddev, .median, .system, .user, .exit_codes)
| (.mean, .min, .max) |= round(2)
| mv("mean"; "duration")
| mv("results_count"; "results")
| .command |= (rm("BEETSDIR= ") | rm(" .*"))
| .parameters |= .query
| if (.parameters | test(" -a")) then .type = "album" else .type = "item" end
| {type, command, parameters, results, duration}
)
| group_map("type")
# | map_values(
# group_map_dict("parameters")
# )
# | sort_by(.items[0].mean - .items[1].mean)
' $1 | {
if (( $#json )); then
jq
else
table
fi
}
}
Steps 3-5
benchmark_beets() {
# args:
# info: benchmark beetmaster and beetfast
local queries_file=queries.txt
local performance_json=master-vs-beetfast-counts.json
local results_json=benchmark_with_counts.json
compare_performance $queries_file $performance_json
add_results_count $performance_json $results_json
show_hyperfine_results $results_json
}
That is a lot of steps for a benchmark! It's useful though, so I might make an issue for it as an enhancement and we can explore additional methods and any simplifications that can be done to the process. It's a good thing to include though, especially when this gets so complex.
We could even include a GitHub action that tests it and gives a message if a PR increases average time by 50% or 25% or whatever we specify, just to give maintainers a hint to scrutinise it a little more closely.
@snejus I just updated after this PR, and it seems that everything is painfully slow. Even a simple beet update takes a significantly long time. Are we supposed to do anything else?
@arsaboo what kind of requests are taking longer?
Beet update seems like the worst hit; even beet import seems slow. Let me know what additional information you would like.
hmm, interesting. @Serene-Arc , do not release a new version just yet, I will investigate this over the coming days
@arsaboo, can you provide the output of beet stats here?
Also, what's your system, Python and SQLite versions?
I added time to the command....even stats takes time.
time beet stats
Tracks: 55012
Total time: 27.3 weeks
Approximate total size: 1.2 TiB
Artists: 16399
Albums: 11414
Album artists: 2361
real 0m27.519s
user 0m24.341s
sys 0m3.152s
Here's another output from beet import even after I specified the id (it took over a minute after accepting the prompt immediately). This used to be almost instantaneous:
$ time beet import -m -I -t ~/shared/music/ --set genre="Filmi" --search-id 0vGMpTlGXYZ
deezer: Deezer API error: no data
/home/arsaboo/shared/music/Jubin Nautiyal/REDACTED (2024) (1 items)
Match (88.2%):
Jubin Nautiyal - REDACTED
≠ album, tracks
Spotify, None, 2024, None, Tips Industries Ltd, None, None
https://open.spotify.com/album/0vGMpTlGXYZ
* Artist: Jubin Nautiyal
...
➜ [A]pply, More candidates, Skip, Use as-is, as Tracks, Group albums,
Enter search, enter Id, aBort, Print tracks, Open files with Picard,
eDit, edit Candidates?
real 1m15.144s
user 1m1.506s
sys 0m10.224s
~$ sqlite3 --version
3.37.2 2022-01-06 13:25:41 872ba256cbf61d9290b571c0e6d82a20c224ca3ad82971edc46b29818d5dalt1
$ python3 --version
Python 3.10.12
I see. You've got old SQLite version which doesn't have built-in JSON functionality - could you try upgrading your SQLite to a version higher than 3.38.0?
You can see in the code that we define the required JSON functions for SQLite < 3.38.0 which are definitely going to be slower than the builtin ones. I didn't realise they were so slow though.
@arsaboo can you also confirm you're using Ubuntu?
yes, I am on Ubuntu
Shouldn't sqlite be automatically updated?
I am reverting this change, look for an incoming pr
See https://github.com/beetbox/beets/pull/5326
Shouldn't sqlite be automatically updated?
Unfortunately Ubuntu does not keep your packages up to date :(. In any case, even with an up-to-date SQLite you will see that these commands take about twice as long to execute - thus the revert, at least for now!
How unfortunate! This PR was very well done. It does highlight that we should have testing in place though for benchmarking purposes, at least where these types of PRs are concerned. And perhaps a wider range of machines on CI for benchmarking, such as specific ubuntu distros and so on, rather than just the latest.