fix: Filter for folders in cleanup old preview job
- Resolves: #35936
Summary
When running OC\Preview\BackgroundCleanupJob, the main iteration loop in run() expects a folder,
however, getOldPreviewLocations() currently does not filter by mimetype
and therefore can yield a non-folder entry which causes an Exception when constructing the Folder impl.
Filtering for httpd/unix-directory, as getNewPreviewLocations() already does, fixes this issue.
Checklist
- Code is properly formatted
- Sign-off message is added to all commits
- [x] Tests (unit, integration, api and/or acceptance) are included
- [ ] Screenshots before/after for front-end changes
- [ ] Documentation (manuals or wiki) has been updated or is not required
- [ ] Backports requested where applicable (ex: critical bugfixes)
Same as https://github.com/nextcloud/server/pull/46072?
Hello there, Thank you so much for taking the time and effort to create a pull request to our Nextcloud project.
We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process.
Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6
Thank you for contributing to Nextcloud and we hope to hear from you soon!
(If you believe you should not receive this message, you can add yourself to the blocklist.)
As the above mentioned PR seems to be staled, I rebased and refined this one.
It now also incorporates the second improvement made in 0c7eeec43f50632c19513697f5ceed056a6e8c69 for the job also ignoring non-numeric named directories.
However, 0c7eeec43f50632c19513697f5ceed056a6e8c69 did this inefficiently by needing to cast all filecache row IDs to a string, which then also results in needing to perform expensive string comparisons for the JOIN.
The same goal is still achieved by further filtering the small amount of entries for this appid of type directory to match against a trivial RegExp.
As an expression to match against a RegExp was not present yet, I implemented it for this PR.
Requesting a review @artonge @kesselb @solracsf
Additionally tagging @Altahrim, @blizzz and @nickvergessen for requesting a review.
Can you please explain why this PR is better than https://github.com/nextcloud/server/pull/46072 ?
Can you please explain why this PR is better than #46072 ?
@solracsf Sure thing:
- The other PR seems to be inactive.
- As I explained in https://github.com/nextcloud/server/pull/48581#issuecomment-2604447290, the changes in the other PR contain an majorly inefficient query change, which results in casting the numeric ID for all files/folders on the Nextcloud instance to a string to then be string-compared against a (to be expected) small list of files, fully bypassing any index, slowing down the periodic
BackgroundCleanupJob. - This PR contains tests for the incorporated changes.
Is the regexp part actually needed? What kind of directory were selected when they should not be?
@artonge The RegExp is only needed for the following reason:
When JOINing the folder names of the previews folder (representing the filecache IDs of the associated images) against the filecache table, we cast the folder name to an integer.
This is fine in normal circumstances, as an untampered, vanilla instance of Nextcloud only creates and stores folders with only numeric names in this directory.
However, what this (and the above mentioned PR) try to fix/implement is a fix to the issue where a user wants, for whatever reason, create arbitrary files/folders there, for example a CACHEDIR.TAG file usually done to exclude the contents of this directory from backups.
The first change already filters out any file, as the query filters by the httpd/unix-directory MIME-type.
However, #46072 also implemented a second change which would also support arbitrary, user-created folders inside this directory by addressing the case when a non-numeric folder name then is supposed to be casted to an integer, which might be undefined/inconsistent behavior for/across some DB engines.
However, as explained in https://github.com/nextcloud/server/pull/48581#issuecomment-2609146772, my approach prevents the costly cast of all file-IDs to a string by just filtering out all non-numeric folder names using the RegExp approach implemented in this PR, as there is otherwise no available better method to check if a SQL VARCHAR is an integer, especially across all (supported) DB engines.
I personally do not see the use-case for supporting arbitrary folders in this directory, which is why I initially did not implement this, however I wanted to reimplement the added support for this from the other PR, while not massively tanking its performance while doing so.
tl;dr: It is needed, if we also want to support arbitrary, user-created folders in the preview appdata directory, otherwise not, which would massively simplify this PR (as it was in its previous form).
@hammer065 Thank you for your work on this and explanations. What we were wondering when discussing this issue with @artonge is that the join on (int)name = fileid should already filter out non-numeric folder names. But there may be corner cases depending on DB servers, I remember that cast can behave quite weird in some situations. Like '1.3' could be cast to 1 and not be filtered out on sqlite I think.
But that’s quite a niche cornercase, and the code after should survive. Worst case scenario I think it will delete a folder in the preview folder which should not be there in the first place.
So all in all I would vote for the simpler version of the PR without the regex usage. Maybe we can keep the regex implementation the queryBuilder though, it may prove useful some day.
Thanks for the detailed summary @hammer065. On my side, I indeed don't see the point in introducing such logic for a hypothetical edge case.
my approach prevents the costly cast of all file-IDs to a string by just filtering out all non-numeric folder names using the RegExp
I would especially be against, as "all non-numeric folder names" would probably be only one.
Maybe we can keep the regex implementation the queryBuilder though, it may prove useful some day.
Not sure about the performance of such operator. Unless it is fast, having it in by default might lead to being it use without knowing better and causing performance issues. If we don't have a use case, let's not keep it.
Please let me know what the end position is after your (internal) discussions, so I can adapt the PR accordingly.
My personal stances are to both not support arbitrary folders, just files, in the previews directory and not to keep the RegEx code.
Please let me know what the end position is after your (internal) discussions, so I can adapt the PR accordingly.
My personal stances are to both not support arbitrary folders, just files, in the previews directory and not to keep the RegEx code.
We agree, please remove regex related code and keep the changes minimal, it will also be easier to backport.
We agree, please remove regex related code and keep the changes minimal, it will also be easier to backport.
@come-nc Done, you can review this now (finally) final commit :D
/backport to stable31
/backport to stable30
/backport to stable29
Thanks for your first pull request and welcome to the community! Feel free to keep them coming! If you are looking for issues to tackle then have a look at this selection: https://github.com/nextcloud/server/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22
@hammer065 Thanks a lot for your patience and team work here!
Would you be interested in a Guest account on our Nextcloud Talk instance where we host a dev chat?
Would you be interested in a Guest account on our Nextcloud Talk instance where we host a dev chat? @come-nc Sure thing, would definitely speed up the discussion process for any potential future PRs :D Can this be linked/Is this linkable to my account on my own Nextcloud instance or would it be a separate account all together?
Would you be interested in a Guest account on our Nextcloud Talk instance where we host a dev chat? @come-nc
Sure thing, would definitely speed up the discussion process for any potential future PRs :D Can this be linked/Is this linkable to my account on my own Nextcloud instance or would it be a separate account all together?
I invited you on the email used for the commit sign-off. It’s a guest account using the guests app, it’s not linkable to a federated account. I think we did have federation enabled for Talk on our instance, but I’m not sure whether it’s enabled for community chat yet.