fiftyone
fiftyone copied to clipboard
[WIP] Virtual fields
Resolves https://github.com/voxel51/fiftyone/issues/2186
This PR adds support for defining virtual fields on datasets and views.
Unlike ordinary fields, virtual field values are not stored in the database; they are dynamically computed and attached when the collection is iterated/aggregated/etc.
Virtual fields are defined by simply including an expr when declaring them:
# Declare a virtual field that counts the number of objects in the `ground_truth` field
dataset.add_sample_field(
"num_objects",
fo.IntField,
expr=F("ground_truth.detections").length(),
)
Virtual fields may also be added to views via the set_field() view stage by including the type of the field as an additional parameter:
# Add a virtual `num_objects` field to the view
view = dataset.set_field(
"num_objects",
F("ground_truth.detections").length(),
ftype=fo.IntField,
)
Virtual view fields exist only on the view (and subsequent views derived from them); they are not added to the underlying dataset.
Note that the ftype information is required so that the new virtual field can be added to the view's schema. When ftype is omitted, set_field() can still be used to modify existing fields in-place or to populate dynamic embedded fields without explicitly adding them to the dataset's schema.
Virtual fields on datasets
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
dataset.compute_metadata()
dataset.add_sample_field(
"num_objects",
fo.IntField,
expr=F("ground_truth.detections").length(),
)
bbox_area = (
F("$metadata.width") * F("bounding_box")[2] *
F("$metadata.height") * F("bounding_box")[3]
)
dataset.add_sample_field(
"ground_truth.detections.area_pixels",
fo.FloatField,
expr=bbox_area,
)
Virtual fields are included in the dataset's schema:
assert "num_objects" in dataset.get_field_schema()
assert "ground_truth.detections.area_pixels" in dataset.get_field_schema(flat=True)
print(list(dataset.get_virtual_field_schema()))
# ['num_objects', 'ground_truth.detections.area_pixels']
Virtual fields are read-only on samples:
sample = dataset.first()
print(sample.num_objects) # 3
sample.num_objects = 4
# ValueError: Virtual fields cannot be edited
Virtual fields are applied after views:
print(dataset.bounds("num_objects"))
# (0, 39)
print(dataset.bounds("ground_truth.detections.area_pixels"))
# (6.371400000000001, 353569.2344)
view = dataset.filter_labels("ground_truth", F("label") == "carrot")
print(view.bounds("num_objects"))
# (1, 13)
print(view.bounds("ground_truth.detections.area_pixels"))
# (155.3496, 118266.5256)
Use materialize() to inject virtual fields earlier so they can be filtered on:
view = (
dataset
.filter_labels("ground_truth", F("label") == "person")
.materialize("num_objects")
.match(F("num_objects") > 10)
.sort_by(F("num_objects"), reverse=True)
)
print(view.values("num_objects"))
# [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 13, 13, 13, 11]
Virtual fields can be selected/excluded from schemas as usual:
fo.pprint(dataset._pipeline())
# Virtual fields are automatically attached here
view = dataset.exclude_fields("num_objects")
assert "num_objects" not in view.get_field_schema()
fo.pprint(view._pipeline())
# `num_objects` is no longer attached
view = dataset.exclude_fields("ground_truth")
assert "ground_truth" not in view.get_field_schema()
fo.pprint(view._pipeline())
# `ground_truth.detections.area_pixels` is no longer attached
view = dataset.exclude_fields("ground_truth.detections.area_pixels")
assert "ground_truth.detections.area_pixels" not in view.get_field_schema(flat=True)
fo.pprint(view._pipeline())
# `ground_truth.detections.area_pixels` is no longer attached
Virtual fields can be "materialized" into regular fields via save():
dataset.save(fields="num_objects")
# Skipping virtual field 'num_objects' when materialize=False
assert "num_objects" in dataset.get_virtual_field_schema()
assert dataset.get_field("num_objects").is_virtual == True
# Converts a virtual field into a regular field
dataset.save(fields="num_objects", materialize=True)
assert "num_objects" not in dataset.get_virtual_field_schema()
assert dataset.get_field("num_objects").is_virtual == False
# Converts all virtual fields into regular fields
dataset.save(materialize=True)
assert len(dataset.get_virtual_field_schema()) == 0
assert dataset.get_field("ground_truth.detections.area_pixels").is_virtual == False
# Fields are no longer virtual; the values are now stored in the database
print(dataset._pipeline()) # []
print(dataset.bounds("num_objects")) # (0, 39)
print(dataset.bounds("ground_truth.detections.area_pixels")) # (6.371400000000001, 353569.2344)
Virtual fields on views
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
view = dataset.set_field(
"num_objects",
F("ground_truth.detections").length(),
ftype=fo.IntField,
)
assert "num_objects" not in dataset.get_field_schema()
assert "num_objects" in view.get_field_schema()
print(view.bounds("num_objects"))
# (0, 39)
sample = view.first()
assert sample.num_objects == 3
sample.num_objects = 4
# ValueError: Virtual fields cannot be edited
Very cool. Thinking out loud:
- Not sure if virtual fields on datasets are needed with saved views combined with
set_field - In an ideal world, I always imagined transforming your schema would just work:
box = F("bounding_box")
view = dataset.set_field("ground_truth.detections.area", box[2] * box[3])
assert isinstance(view.get_field_schema(flat=True)["ground_truth.detections.area"], fo.FloatField)
- Are these not fields from a view, and therefore could be named View Fields? View feels like the accepted terminology here (beyond FiftyOne, DBs in general)
I would love to make:
view = dataset.set_field("ground_truth.detections.area", box[2] * box[3])
directly add a new virtual field to the schema, but the trouble is that our ViewExpression methods aren't typed, so we have no way to impute from the provided expression what Field type to use for the virtual field. For now it seems we have to rely on the user to tell us via ftype.
I would love to make:
view = dataset.set_field("ground_truth.detections.area", box[2] * box[3])directly add a new virtual field to the schema, but the trouble is that our
ViewExpressionmethods aren't typed, so we have no way to impute from the provided expression whatFieldtype to use for the virtual field. For now it seems we have to rely on the user to tell us viaftype.
Yes, I'm just proposing that we put that on the roadmap and work towards it
Codecov Report
Patch and project coverage have no change.
Comparison is base (
4ffccbc) 15.98% compared to head (c865c2b) 15.98%. Report is 10 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #2429 +/- ##
========================================
Coverage 15.98% 15.98%
========================================
Files 572 572
Lines 71235 71235
Branches 800 800
========================================
Hits 11388 11388
Misses 59847 59847
| Flag | Coverage Δ | |
|---|---|---|
| app | 15.98% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@benjaminpkane any chance you could help me get this PR across the finish line? I believe there's just two small App tweaks that need to happen:
In the grid view:
- If a user filters by a FIELD that is virtual in the sidebar, a
Materialize(FIELD)stage must be injected prior to the filter
In the modal:
- If the sample contains virtual fields, the virtual field's value must be recomputed every time the user applies filters in the sidebar (equivalently, the entire sample could be reloaded if that is preferable from an implementation standpoint)
It would also be cool perhaps to add an icon next to virtual fields in the App to indicate to the user that they are, in fact, virtual 🤔
If a user filters by a FIELD that is virtual in the sidebar, a Materialize(FIELD) stage must be injected prior to the filter
The first item (above) is straightforward and can be worked on. On the second point:
If the sample contains virtual fields, the virtual field's value must be recomputed every time the user applies filters in the sidebar (equivalently, the entire sample could be reloaded if that is preferable from an implementation standpoint)
It seems like a fair amount of complexity is underneath this. Filtering on ground_truth labels could change a sample-level virtual field which could create a disjoint set of results on some virtual label attribute. The loading states would be tricky, but I'm more concerned about breaking the invariant that filtering strictly results in a subset of values, e.g. slider bounds could become larger because of filtering on some other field.
I think all virtual fields should probably be materialized before extending the view with sidebar filters. This would mean modal values would not update. If not, the order in which sidebar filtering stages are added can change results
@benjaminpkane hmm, my mental is the following:
- If no filters involve virtual fields, then they are statically computed at the very end
- Whenever a filter does involve a virtual field, a
materialize()stage is injected prior to the filter. For all subsequent stages in the view, that virtual field behaves as a static field
Materializing all virtual fields before extending the view would reduce the usability of the feature in many cases. For example, the num_objects field in the example dataset above allows me to, for example, apply some sidebar filters and then dynamically look at the num_objects tag in the grid to see how many objects matched the filter.
This is a subtle feature though; I may be missing something. If you have any concrete problem cases that would help!
I definitely see the value in this feature. I just want to make sure it's clear what the App should do if we add these dependency graphs for sample fields.
A basic decision point that feels unclear shows up in the num_objects example.
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
dataset.compute_metadata()
dataset.add_sample_field(
"num_objects",
fo.IntField,
expr=F("ground_truth.detections").length(),
)
If a user first filters with the num_obects slider in the sidebar, and then filters any ground_truth field, the natural thing to do is to apply the ground_truth filtering first. So this means the extended view needs to know that num_objects is a virtual field that should materialize after ground_truth. In this case, the slider values need not change, but the data being filtered would change.
But if the virtual field was instead has_objects (a contrived string field example) and every sample had detections before filtering, then the filter would only materialize yes as a possible value. Once filtering occurs, no values could materialize which means the App may have to re-query values for virtual fields in the sidebar as other filters change to remain consistent with sample data.
dataset.add_sample_field(
"has_objects",
fo.StringField,
expr=(F("ground_truth.detections").length() > 0).if_else("yes", "no"),
)
Walkthrough
The changes in this pull request introduce enhancements to the management of fields within datasets in the FiftyOne framework. Key updates include the addition of virtual fields, which are dynamically computed from existing fields, and improvements to the handling of sample and frame fields. The documentation has been updated to clarify these functionalities, and several methods related to field management have been added or modified across multiple files.
Changes
| Files | Change Summary |
|---|---|
docs/source/user_guide/using_datasets.rst, docs/source/user_guide/using_views.rst |
Enhanced documentation on managing fields and virtual fields, including examples and methods for adding fields and utilizing virtual fields in views. |
fiftyone/__public__.py |
Introduced a new public entity, Materialize, expanding module functionality. |
fiftyone/core/*.py |
Multiple enhancements related to virtual fields, including new methods for managing virtual fields, updates to existing methods to include virtual parameters, and improved field validation. |
tests/unittests/virtual_tests.py |
Added comprehensive unit tests for virtual fields, covering creation, manipulation, and validation of virtual fields in datasets and views. |
Assessment against linked issues
| Objective | Addressed | Explanation |
|---|---|---|
| Add dynamic fields based on view contents (2186) | ✅ | |
| Improve usability of dynamic fields for real-time updates based on view filters (2186) | ✅ |
Possibly related PRs
- #4787: Involves modifications to the
fiftyone/core/dataset.pyfile, focusing on deletion operations, which do not directly relate to the enhancements in managing fields and virtual fields described in the main PR.
Poem
🐰 In fields of data, bright and new,
Virtual wonders come into view.
With every hop and every glance,
Dynamic fields now dance and prance!
So let us cheer, with joy and glee,
For FiftyOne's magic, wild and free! 🌼
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
:bangbang: IMPORTANT Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples: --@coderabbitai generate unit testing code for this file.--@coderabbitai modularize this function. - PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: --@coderabbitai generate interesting stats about this repository and render them as a table.--@coderabbitai read src/utils.ts and generate unit testing code.--@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.--@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.