Recommendations app causes slow queries saturating mysql
We're seeing a lot of sloooow queries due to the recommendations app. So many it actually brought the server to a halt.
They all have a similar pattern but differ in lenght.
Here's one example from mysqldumpslow. As you can see the query is taking ~ 25 min. on average.
Count: 33 Time=1602.62s (52886s) Lock=0.01s (0s) Rows_sent=7.0 (231), Rows_examined=7505485.0 (247681005),
query from mysqldump slow
SELECT `file`.`fileid`, `storage`, `path`, `path_hash`, `file`.`parent`, `name`, `mimetype`, `mimepart`, `size`, `mtime`, `storage_mtime`, `encrypted`, `etag`, `permissions`, `checksum`, `metadata_etag`, `creation_time`, `upload_time`
FROM `oc_filecache` `file`
LEFT JOIN `oc_filecache_extended` `fe` ON `file`.`fileid` = `fe`.`fileid`
WHERE ((`mimetype` <> N) OR (`size` = N))
AND (
(
(`storage` = N) AND ((`path` = 'S') OR (`path` LIKE 'S'))
)
OR
(
((`storage` = N) AND ((`path_hash` = 'S') OR (`path` LIKE 'S')))
AND
((`path_hash` = 'S') OR (`path` LIKE 'S'))
)
OR (
...
)
)
ORDER BY `mtime` + 'S' desc
LIMIT N
The actual values for the order and limit query are:
ORDER BY `mtime` + '0' desc LIMIT 7
This looks very much like the call to getRecent in
https://github.com/nextcloud/recommendations/blob/0342f4c6370e2396e6b54520cc2e27be245c74ab/lib/Service/RecentlyEditedFilesSource.php#L47-L71
Given that $max is 7 by default.
SHOW PROCESSLIST shows them as Creating sort index most of the time.
So maybe all we are missing is an index on mtime + '0'. Or maybe we should do away with the + '0'.
If you open /apps/files/?dir=/&view=recent you should see the same heavy query because both views use the same Files API to get the most recent files.
@ChristophWurst thanks! That might make it easier to trigger this and test fixes.
Disabled recommendations app for now and server is responsive again. I guess when everyone logs in at the same time in the morning and the dashboards trigger some long running queries we're running out of CPUs. Visiting most recent files is not as common i guess.
Oh... and also... there were quite a few group folders mentioned in the sloow queries. Might be just because they contain a lot of files or so - but maybe there's some additional complexity there.
Explain with the given query says it uses the fs_storage_mimetype index. I tried adding a fs_storage_mimetype_mtime index but it would not pick that up.
Looping in @CarlSchwan since the keyword "recent files" triggered a hit for your work during the performance hackweek or at least I link it to you :)
The issue is that it's using the search API that is really heavy on the DB level. @icewind1991 improved it quite a bit during the last hackweek so that the number of queries generated decreased by a lot.
Unfortunately, the big DB query that does the actual search across all the mountpoints of an user (share, groupfolders, ...) and generating this slow query is still here :( I discussed it a bit with Robin during the hack week and we don't see a way to make it lighter unless we make it less powerful e.g. not searching for mountpoints but only the user storage.
This is the same issue with the photo app that is currently timing out in perf testing due to this search.
@ChristophWurst besides the emojis what is your view on a "lighter" search query with respect to the recommendation app's result-quality? Would this be a viable option (your emoji suggests otherwise, so just asking explicitly)
This is the same issue with the photo app that is currently timing out in perf testing due to this search.
@CarlSchwan this implies we will have to come up with a fix anyways because even without the recommendations app we would then kill an instance with the new photos which is to be expected to be used by magenta at some point. So who would we need to gather to tackle this?
It looks like the sort is taking most of the time. I wonder if limiting the search to say files that changed within the last two weeks would help. We sort by mtime and limit by 7. So most of the time limiting to the last two weeks will reduce the number of files to be sorted a lot while without an impact on the result.
I don't know if the database knows to limit the sort to the entries that can actually make it into the top 7.
update: looks like we can find this out by setting some parameters for slow query logs: https://mariadb.com/kb/en/filesort-with-small-limit-optimization/
Implemented @max-nextcloud's idea in https://github.com/nextcloud/server/pull/33788
Unit tests are failing though, probably because the mtime in the test cases are not correctly configured. I will need to look into
@ChristophWurst besides the emojis what is your view on a "lighter" search query with respect to the recommendation app's result-quality? Would this be a viable option (your emoji suggests otherwise, so just asking explicitly)
AFAIK recommend files aren't as useful as we originally thought, so if we do a lighter way and exclude some data it would probably not harm user experience but rather enhance it loading the files becomes faster. As in, rather have 5 not so relevant files in 5s than slightly more relevant files in a timeout.
PR: Just to make the situation clear: this app is a consumer of the server API and has no direct influence on how the server generates the list of recent files. This problem is also found in other places, like the Recent page in the Files app.
From what i heard the +0 was added to prevent the mtime index from being used. Tests on a productions instance look like the + 0 actually does more harm then good here. (>60s instead of 2.5 sec)
We are limiting the results in the recommendation app to the 7 most recent. So i think the two approaches compare like this:
- take all files on the instance ordered by mtime. Go through them one by one and check if the user has access until you found the 7 most recently changed ones with access
- take all files the user has access to and sort them by mtime. Once you have sorted all of them ... take the top 7.
If a user has access to a lot of files sorting all of them might well outweight the cost of going through the most recent files and only picking the ones with access.
I don't quite understand why we cannot have the best of both worlds. Can't we do away with the + 0 and have an fs_storage_mimetype_mtime index (right now DESCRIBE says we are using fs_storage_mimetype)
edit: (Note that the mtime needs to be last on the index as it's used in sort / comparison WHERE clause)
Actually looking at the index used my comparison is wrong...
Looks like right now we use the index to exclude non empty folders... Those typically less than 50% of the nodes.
So... do we consider nextcloud/server#33788 the final word on this performance matter for now? Or does this need stay open to track something further?
I'm happy to close this. If things are still slow new investigations will be needed anyway.