fiftyone icon indicating copy to clipboard operation
fiftyone copied to clipboard

[WIP] Virtual fields

Open brimoor opened this issue 2 years ago • 10 comments

Resolves https://github.com/voxel51/fiftyone/issues/2186

This PR adds support for defining virtual fields on datasets and views.

Unlike ordinary fields, virtual field values are not stored in the database; they are dynamically computed and attached when the collection is iterated/aggregated/etc.

Virtual fields are defined by simply including an expr when declaring them:

# Declare a virtual field that counts the number of objects in the `ground_truth` field
dataset.add_sample_field(
    "num_objects",
    fo.IntField,
    expr=F("ground_truth.detections").length(),
)

Virtual fields may also be added to views via the set_field() view stage by including the type of the field as an additional parameter:

# Add a virtual `num_objects` field to the view
view = dataset.set_field(
    "num_objects",
    F("ground_truth.detections").length(),
    ftype=fo.IntField,
)

Virtual view fields exist only on the view (and subsequent views derived from them); they are not added to the underlying dataset.

Note that the ftype information is required so that the new virtual field can be added to the view's schema. When ftype is omitted, set_field() can still be used to modify existing fields in-place or to populate dynamic embedded fields without explicitly adding them to the dataset's schema.

Virtual fields on datasets

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")
dataset.compute_metadata()

dataset.add_sample_field(
    "num_objects",
    fo.IntField,
    expr=F("ground_truth.detections").length(),
)

bbox_area = (
    F("$metadata.width") * F("bounding_box")[2] *
    F("$metadata.height") * F("bounding_box")[3]
)

dataset.add_sample_field(
    "ground_truth.detections.area_pixels",
    fo.FloatField,
    expr=bbox_area,
)

Virtual fields are included in the dataset's schema:

assert "num_objects" in dataset.get_field_schema()
assert "ground_truth.detections.area_pixels" in dataset.get_field_schema(flat=True)

print(list(dataset.get_virtual_field_schema()))
# ['num_objects', 'ground_truth.detections.area_pixels']

Virtual fields are read-only on samples:

sample = dataset.first()
print(sample.num_objects)  # 3

sample.num_objects = 4
# ValueError: Virtual fields cannot be edited

Virtual fields are applied after views:

print(dataset.bounds("num_objects"))
# (0, 39)

print(dataset.bounds("ground_truth.detections.area_pixels"))
# (6.371400000000001, 353569.2344)

view = dataset.filter_labels("ground_truth", F("label") == "carrot")

print(view.bounds("num_objects"))
# (1, 13)

print(view.bounds("ground_truth.detections.area_pixels"))
# (155.3496, 118266.5256)

Use materialize() to inject virtual fields earlier so they can be filtered on:

view = (
    dataset
    .filter_labels("ground_truth", F("label") == "person")
    .materialize("num_objects")
    .match(F("num_objects") > 10)
    .sort_by(F("num_objects"), reverse=True)
)

print(view.values("num_objects"))
# [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 13, 13, 13, 11]

Virtual fields can be selected/excluded from schemas as usual:

fo.pprint(dataset._pipeline())
# Virtual fields are automatically attached here

view = dataset.exclude_fields("num_objects")
assert "num_objects" not in view.get_field_schema()

fo.pprint(view._pipeline())
# `num_objects` is no longer attached

view = dataset.exclude_fields("ground_truth")
assert "ground_truth" not in view.get_field_schema()

fo.pprint(view._pipeline())
# `ground_truth.detections.area_pixels` is no longer attached

view = dataset.exclude_fields("ground_truth.detections.area_pixels")
assert "ground_truth.detections.area_pixels" not in view.get_field_schema(flat=True)

fo.pprint(view._pipeline())
# `ground_truth.detections.area_pixels` is no longer attached

Virtual fields can be "materialized" into regular fields via save():

dataset.save(fields="num_objects")
# Skipping virtual field 'num_objects' when materialize=False

assert "num_objects" in dataset.get_virtual_field_schema()
assert dataset.get_field("num_objects").is_virtual == True

# Converts a virtual field into a regular field
dataset.save(fields="num_objects", materialize=True)

assert "num_objects" not in dataset.get_virtual_field_schema()
assert dataset.get_field("num_objects").is_virtual == False

# Converts all virtual fields into regular fields
dataset.save(materialize=True)

assert len(dataset.get_virtual_field_schema()) == 0
assert dataset.get_field("ground_truth.detections.area_pixels").is_virtual == False

# Fields are no longer virtual; the values are now stored in the database
print(dataset._pipeline())  # []
print(dataset.bounds("num_objects"))  # (0, 39)
print(dataset.bounds("ground_truth.detections.area_pixels"))  # (6.371400000000001, 353569.2344)

Virtual fields on views

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

view = dataset.set_field(
    "num_objects",
    F("ground_truth.detections").length(),
    ftype=fo.IntField,
)

assert "num_objects" not in dataset.get_field_schema()
assert "num_objects" in view.get_field_schema()

print(view.bounds("num_objects"))
# (0, 39)

sample = view.first()
assert sample.num_objects == 3

sample.num_objects = 4
# ValueError: Virtual fields cannot be edited

brimoor avatar Dec 13 '22 15:12 brimoor

Very cool. Thinking out loud:

  • Not sure if virtual fields on datasets are needed with saved views combined with set_field
  • In an ideal world, I always imagined transforming your schema would just work:
box = F("bounding_box")
view = dataset.set_field("ground_truth.detections.area", box[2] * box[3])

assert isinstance(view.get_field_schema(flat=True)["ground_truth.detections.area"], fo.FloatField)
  • Are these not fields from a view, and therefore could be named View Fields? View feels like the accepted terminology here (beyond FiftyOne, DBs in general)

benjaminpkane avatar Dec 14 '22 15:12 benjaminpkane

I would love to make:

view = dataset.set_field("ground_truth.detections.area", box[2] * box[3])

directly add a new virtual field to the schema, but the trouble is that our ViewExpression methods aren't typed, so we have no way to impute from the provided expression what Field type to use for the virtual field. For now it seems we have to rely on the user to tell us via ftype.

brimoor avatar Dec 14 '22 15:12 brimoor

I would love to make:

view = dataset.set_field("ground_truth.detections.area", box[2] * box[3])

directly add a new virtual field to the schema, but the trouble is that our ViewExpression methods aren't typed, so we have no way to impute from the provided expression what Field type to use for the virtual field. For now it seems we have to rely on the user to tell us via ftype.

Yes, I'm just proposing that we put that on the roadmap and work towards it

benjaminpkane avatar Dec 14 '22 16:12 benjaminpkane

Codecov Report

Patch and project coverage have no change.

Comparison is base (4ffccbc) 15.98% compared to head (c865c2b) 15.98%. Report is 10 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #2429   +/-   ##
========================================
  Coverage    15.98%   15.98%           
========================================
  Files          572      572           
  Lines        71235    71235           
  Branches       800      800           
========================================
  Hits         11388    11388           
  Misses       59847    59847           
Flag Coverage Δ
app 15.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 15 '22 20:12 codecov[bot]

@benjaminpkane any chance you could help me get this PR across the finish line? I believe there's just two small App tweaks that need to happen:

In the grid view:

  • If a user filters by a FIELD that is virtual in the sidebar, a Materialize(FIELD) stage must be injected prior to the filter

In the modal:

  • If the sample contains virtual fields, the virtual field's value must be recomputed every time the user applies filters in the sidebar (equivalently, the entire sample could be reloaded if that is preferable from an implementation standpoint)

It would also be cool perhaps to add an icon next to virtual fields in the App to indicate to the user that they are, in fact, virtual 🤔

brimoor avatar Aug 25 '23 02:08 brimoor

If a user filters by a FIELD that is virtual in the sidebar, a Materialize(FIELD) stage must be injected prior to the filter

The first item (above) is straightforward and can be worked on. On the second point:

If the sample contains virtual fields, the virtual field's value must be recomputed every time the user applies filters in the sidebar (equivalently, the entire sample could be reloaded if that is preferable from an implementation standpoint)

It seems like a fair amount of complexity is underneath this. Filtering on ground_truth labels could change a sample-level virtual field which could create a disjoint set of results on some virtual label attribute. The loading states would be tricky, but I'm more concerned about breaking the invariant that filtering strictly results in a subset of values, e.g. slider bounds could become larger because of filtering on some other field.

benjaminpkane avatar Aug 28 '23 22:08 benjaminpkane

I think all virtual fields should probably be materialized before extending the view with sidebar filters. This would mean modal values would not update. If not, the order in which sidebar filtering stages are added can change results

benjaminpkane avatar Aug 28 '23 23:08 benjaminpkane

@benjaminpkane hmm, my mental is the following:

  • If no filters involve virtual fields, then they are statically computed at the very end
  • Whenever a filter does involve a virtual field, a materialize() stage is injected prior to the filter. For all subsequent stages in the view, that virtual field behaves as a static field

Materializing all virtual fields before extending the view would reduce the usability of the feature in many cases. For example, the num_objects field in the example dataset above allows me to, for example, apply some sidebar filters and then dynamically look at the num_objects tag in the grid to see how many objects matched the filter.

This is a subtle feature though; I may be missing something. If you have any concrete problem cases that would help!

brimoor avatar Aug 30 '23 14:08 brimoor

I definitely see the value in this feature. I just want to make sure it's clear what the App should do if we add these dependency graphs for sample fields.

A basic decision point that feels unclear shows up in the num_objects example.

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")
dataset.compute_metadata()

dataset.add_sample_field(
    "num_objects",
    fo.IntField,
    expr=F("ground_truth.detections").length(),
)

If a user first filters with the num_obects slider in the sidebar, and then filters any ground_truth field, the natural thing to do is to apply the ground_truth filtering first. So this means the extended view needs to know that num_objects is a virtual field that should materialize after ground_truth. In this case, the slider values need not change, but the data being filtered would change.

But if the virtual field was instead has_objects (a contrived string field example) and every sample had detections before filtering, then the filter would only materialize yes as a possible value. Once filtering occurs, no values could materialize which means the App may have to re-query values for virtual fields in the sidebar as other filters change to remain consistent with sample data.

dataset.add_sample_field(
    "has_objects",
    fo.StringField,
    expr=(F("ground_truth.detections").length() > 0).if_else("yes", "no"),
)

benjaminpkane avatar Sep 05 '23 00:09 benjaminpkane

Walkthrough

The changes in this pull request introduce enhancements to the management of fields within datasets in the FiftyOne framework. Key updates include the addition of virtual fields, which are dynamically computed from existing fields, and improvements to the handling of sample and frame fields. The documentation has been updated to clarify these functionalities, and several methods related to field management have been added or modified across multiple files.

Changes

Files Change Summary
docs/source/user_guide/using_datasets.rst, docs/source/user_guide/using_views.rst Enhanced documentation on managing fields and virtual fields, including examples and methods for adding fields and utilizing virtual fields in views.
fiftyone/__public__.py Introduced a new public entity, Materialize, expanding module functionality.
fiftyone/core/*.py Multiple enhancements related to virtual fields, including new methods for managing virtual fields, updates to existing methods to include virtual parameters, and improved field validation.
tests/unittests/virtual_tests.py Added comprehensive unit tests for virtual fields, covering creation, manipulation, and validation of virtual fields in datasets and views.

Assessment against linked issues

Objective Addressed Explanation
Add dynamic fields based on view contents (2186)
Improve usability of dynamic fields for real-time updates based on view filters (2186)

Possibly related PRs

  • #4787: Involves modifications to the fiftyone/core/dataset.py file, focusing on deletion operations, which do not directly relate to the enhancements in managing fields and virtual fields described in the main PR.

Poem

🐰 In fields of data, bright and new,
Virtual wonders come into view.
With every hop and every glance,
Dynamic fields now dance and prance!
So let us cheer, with joy and glee,
For FiftyOne's magic, wild and free! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

:bangbang: IMPORTANT Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples: -- @coderabbitai generate unit testing code for this file. -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: -- @coderabbitai generate interesting stats about this repository and render them as a table. -- @coderabbitai read src/utils.ts and generate unit testing code. -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format. -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar Sep 17 '24 14:09 coderabbitai[bot]