fiftyone icon indicating copy to clipboard operation
fiftyone copied to clipboard

Smarter deletion of temp datasets

Open kaixi-wang opened this issue 1 year ago • 5 comments

Changes:

  • add a flag for not deleting datasets referenced in saved views
  • don't add deleted datasets to singletons list

Todo:

  • add tests

kaixi-wang avatar Oct 31 '23 00:10 kaixi-wang

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (1ebf3f0) 16.17% compared to head (f623b3a) 16.17%. Report is 5 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3743   +/-   ##
========================================
  Coverage    16.17%   16.17%           
========================================
  Files          641      641           
  Lines        74137    74137           
  Branches       982      982           
========================================
  Hits         11988    11988           
  Misses       62149    62149           
Flag Coverage Δ
app 16.17% <55.31%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...perators/src/OperatorInvocationRequestExecutor.tsx 0.00% <ø> (ø)
...ackages/relay/src/fragments/mediaFieldsFragment.ts 100.00% <100.00%> (ø)
app/packages/relay/src/graphQLSyncFragmentAtom.ts 52.17% <100.00%> (ø)
app/packages/state/src/recoil/index.ts 100.00% <100.00%> (ø)
app/packages/state/src/recoil/selectors.ts 50.98% <ø> (ø)
app/packages/state/src/recoil/mediaFields.ts 83.33% <83.33%> (ø)
app/packages/operators/src/built-in-operators.ts 0.00% <0.00%> (ø)
app/packages/state/src/recoil/schema.ts 39.97% <42.10%> (ø)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 31 '23 00:10 codecov[bot]

Tracking whether a collection is in-use or not is a relevant problem that connects with work that @swheaton is doing on versioning. But the relevant definition of "in use" is whether a user is literally using the dataset at that moment, not whether a saved view that's not currently being used stores a reference to a dataset.

brimoor avatar Oct 31 '23 01:10 brimoor

Is there a specific problem you're trying to solve? There should be another way to solve it.

Views like ToPatches() that use temporary datasets are designed to automatically regenerate their backing datasets if they're ever deleted:

https://github.com/voxel51/fiftyone/blob/440704307d8a480eefa99d102939fa576e5c99e5/fiftyone/core/stages.py#L7453

In fact it is preferable that these temporary datasets do get deleted; if the old collection still exists, then loading a saved patches view would just use it, even though the root dataset may have changed significantly since the view was saved. The patches view would then be out of sync, which is counter to the goal of dataset views: they're supposed to be a virtual concept that's always in sync (yes, creating temporary collections is an expensive operation; but I didn't have a better idea at the time to implement these views).

The way to force a patches view to "resync" with the underlying dataset is to call view.reload(), which, case and point, deletes the temporary dataset, because this is a safe thing to do:

https://github.com/voxel51/fiftyone/blob/440704307d8a480eefa99d102939fa576e5c99e5/fiftyone/core/patches.py#L335

regenerating the view is expensive and can take time, so when loading saved views in the app, if the temp dataset doesn't already exist, the app will throw an error. Also, because build view is called multiple times when a single view is loaded, the app will try to create the same temp dataset, leading to a duplicate index error.

I think a better approach is to update the temp dataset when the root dataset is updated rather than just randomly delete and completely regenerate views, especially if the datasets are large.

kaixi-wang avatar Oct 31 '23 16:10 kaixi-wang

Regenerating the temp dataset only when the root dataset is updated would have ideal properties from a reuse standpoint, but we don't have a way to track Dataset.last_updated_at just yet. It does connect with work that @swheaton is doing on Sample.last_updated_at though. Any efforts to reuse, retain, or otherwise change the behavior of temporary dataset cleanup should really be done in conjunction with dataset versioning, as they both use "hidden datasets" and have similar cleanup needs.

If loading a saved view in the App builds the view multiple times, I can see how that could lead to temporary patch datasets getting generated multiple times concurrently, which is wasteful from a computation standpoint. Sounds like we should try to remove duplicate view load calls, for efficiency but also just for cleanliness.

Can you provide a reproducible example of seeing a duplicate index error? Can't guess how/why that would be happening.

brimoor avatar Oct 31 '23 19:10 brimoor

If loading a saved view in the App builds the view multiple times,

@kaixi-wang can you share a video on how to confirm we are loading a view multiple times when saved view loads? I just checked the network tab and I only see a set_view call once. maybe you mean in the SDK side?

manivoxel51 avatar Nov 01 '23 13:11 manivoxel51