core icon indicating copy to clipboard operation
core copied to clipboard

[Not merge] Force a scan before trying to cleanup chunks

Open jvillafanez opened this issue 2 years ago • 9 comments

Description

There might be cases where the upload data has disappeared from the DB, but the actual chunks are still present in the FS. At the moment, those chunks can't be cleaned even if they're too old, so they have to be removed manually. This PR forces a scan on the upload folder in order to be able to delete those strayed chunks

Related Issue

https://github.com/owncloud/enterprise/issues/5125

Motivation and Context

The chunks can take a lot of space, and there can be a lot of uploads happening to be able to remove the old one manually.

How Has This Been Tested?

Expected test case is:

  1. Having 2 servers behind a load balancer, each server with its own upload folder (not shared)
  2. Upload a file from the desktop client and abort the upload halfway through.
  3. Check that the chunks are present in the upload folder of one of the server (only one, not both)
  4. Wait a couple of days.
  5. From the server without the chunks, run occ dav:cleanup-chunks. This will remove the info about the uploads in the DB. Note that the chunks will still be present in the FS of the other server
  6. From the server with the chunks, run occ dav:cleanup-chunks -l. This will force the scan of the upload folder, bring the data back into the DB, and then cleanup the chunks.

Actual test:

  1. With only one server
  2. Upload a file from the desktop client and abort the upload halfway through
  3. Check that the chunks are present in the upload folder
  4. Change the mtime of the folder containing the chunks and the chunks in the FS so they're uploaded at least 2 days ago. Do the same for the related DB entries in the oc_filecache table (storage_mtime and mtime columns). This will simulate that more than 2 days have passed since the aborted upload.
  5. Remove the DB entries for the upload in the oc_filecache table. This includes the "uploads/" entry and the contents of that folder. This will simulate the "wrong" occ dav:cleanup-chunk in the wrong server.
  6. Run the occ dav:cleanup-chunks -l

Screenshots (if appropriate):

Types of changes

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Database schema changes (next release will require increase of minor version instead of patch)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Technical debt
  • [ ] Tests only (no source changes)

Checklist:

  • [x] Code changes
  • [ ] Unit tests added
  • [ ] Acceptance tests added
  • [ ] Documentation ticket raised:
  • [ ] Changelog item, see TEMPLATE

jvillafanez avatar Apr 21 '22 08:04 jvillafanez

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

update-docs[bot] avatar Apr 21 '22 08:04 update-docs[bot]

Tests did not run in CI: https://drone.owncloud.com/owncloud/core/35352/12/3

### changed files ###

 apps/dav/lib/Command/CleanupChunks.php | 15 +++++++++++----

### check if disallowed file was changed ###

 - no file disallowed to be skipped was changed

The logic for deciding that source code has changed is wrong. I will make an issue for that and fix it...

phil-davis avatar Apr 21 '22 08:04 phil-davis

The logic for deciding that source code has changed is wrong. I will make an issue for that and fix it...

Issue https://github.com/owncloud/core/issues/39992 is fixed in core master. Please rebase this PR and the correct test pipelines should all run.

phil-davis avatar Apr 21 '22 09:04 phil-davis

Spotted a potential issue with the solution. If the command runs with the -l option in the "wrong" server, it will remove the data from the DB. The expected setup could be that there is a cron job running occ dav:cleanup-chunks -l in all the servers. Without the PR, this should work fine because the data in the DB wasn't deleted. However, with the PR, due to the scan, the data is removed from the DB, which could interfere with the normal behavior of the rest of the servers.

I think it will be better to split the option. The new expectations are:

  • All servers will have a cron job running the occ dav:cleanup-chunks -l command. The behavior will be the same as it has been before.
  • In case that some data isn't removed, the admin will have to run the new (to be fixed) occ dav:cleanup-chunks -l --forceScan command. This will perform the scan planned in this PR. This means that the new --forceScan will be used as a potential recovery mechanism.

jvillafanez avatar Apr 21 '22 11:04 jvillafanez

:exclamation: there might be side effects caused by the scan. The scan might break existing uploads happening at the same time.

jvillafanez avatar Apr 21 '22 12:04 jvillafanez

❗ there might be side effects caused by the scan. The scan might break existing uploads happening at the same time.

Can be added a TTL? X time that might be configurable?

cdamken avatar Nov 18 '22 12:11 cdamken

@jvillafanez can you summarise here what is the exact flow of operations in your proposed fix ? Like a flow chart ? But please consider distributed setup with e.g. 3 servers. Would be easier to say if the design is good or not.

mrow4a avatar Nov 21 '22 10:11 mrow4a

I think the top post has all the information.

If I remember correctly, the problem is that somehow we have chunks in the FS but not in the DB. This is regardless of the amount of servers we have, and it's the problem the PR tried to fix.

In order to reach that state (chunks in the FS but not in the DB), what I assumed is the steps to reproduce, particularly the "expected test case" with the steps from 1 to 5. Last step is the one that should fix the issue with this PR. Basically, those steps is what I assume happened. The "actual test" is what I did to reach the same state in order to test the PR.

However, the current PR has some problems as explained in https://github.com/owncloud/core/pull/39990#issuecomment-1105066172

I think the main problem is that they had a lot of chunks that couldn't be deleted from the FS because they didn't have the matching entry in the DB. However, it isn't clear how they end up with that. The steps to reproduce are an assumption of how they could reach that state, but it might be a wrong assumption.

jvillafanez avatar Nov 21 '22 13:11 jvillafanez

Can we close this? @mrow4a

IljaN avatar Oct 02 '23 08:10 IljaN