fiftyone
fiftyone copied to clipboard
Revamped Hugging Face Hub Integration
First draft of the new Hugging Face Hub integration.
Hugging Face Integration — FiftyOne 0.24.0 documentation.pdf
Overview
This integration introduces two functions: load_from_hub() and push_to_hub().
The major architectural changes from the previous version are as follows:
- Use configs
fiftyone.yml,fiftyone.yaml, or a custom local yaml config file (can be extended to github as well) specifying the format of the dataset to be converted into FiftyOne. This is in contrast to the previous version of the integration, which had a default converter that tried its best but was not guaranteed to succeed, and custom Python loading script execution, which we can still add in at some point, but is not necessarily the lowest barrier to entry - As opposed to using HF's
datasetslibrary, loading the dataset with theirload_dataset()function, and then converting — which was limiting our flexibility, requiring massive downloads in many cases, and resulting in duplication of a ton of files, this version uses the HF Datasets Server API to directly request and load the relevant data within needing to go through thedatasetsPython library. This gives a few key additional advantages, as I will document below. - Config info can be specified exclusively via kwargs, removing the need to explicitly create a
fiftyone.ymlfile locally. - Supports loading datasets from gated repos
Additional improvements:
- Added Logging during the loading & conversion process
description,license, andrepo_idfrom the HF config file are added to the FiftyOne dataset'sinfodictionary, and all tags listed in the config are added to the dataset'stags.- Support for thumbnails in the sample grid, additional media fields, and segmentation mask labels
- Includes FiftyOne version validation, ensuring that the dataset to be loaded is compatible with the user's current version of
fiftyone
Loading from the Hub
The load_from_hub() utility in the hf_hub utils allows you to load datasets from the Hugging Face Hub that are in either:
- Parquet format — compatible with the Hugging Face
datasetslibrary, and accessible via the Datasets Server API - Any FiftyOne dataset type, from those listed here
When you use load_from_hub(), you must specify the repo_id, which specifies the organization and repo on Hugging Face Hub where the dataset can be found. This is the only positional argument.
The loading config also needs to be specified in one of three ways:
- Via a
fiftyone.ymlorfiftyone.yamlfile in the Hugging Face repo itself - Using the
config_filekeyword argument to specify the location (locally) of the config file to use - By passing the config params directly into the
load_from_hub()call via keywords.
The only required element is a format specifier. For Parquet datasets, you can use format="ParquetFilesDataset" or format="parquet" for short. For FiftyOne formats, use the name of the class. For instance, for a dataset in the format fiftyone.types.dataset_types.COCODetectionDataset, use format="COCODetectionDataset".
Loading Arguments
Additionally, the user can specify options:
revision: the revision (or version commit) or the dataset to loadsplitorsplits: which of the available splits they want to load.subsetorsubsets: which of the available subsets they want to load. Many datasets on the hub have multiple subsets. As an example, check out the newyorker_caption_contest, which has 30 subsets.max_samples: the maximum number of samples per <split, subset> pair to load. This can be useful if you want to rapidly get a feel for the dataset without downloading 100s of GBs of data.batch_size: the batch size to use when requesting data from the datasets server and adding samples to the FiftyOne datasetnum_workers: thread pool workers to use when downloading mediaoverwrite: whether to overwrite existing documents for the datasetpersistent: whether to persist the loaded dataset to diskname: a name to use for the dataset. If included, this will override any name present in the config file.
Example Usage
To illustrate the power, flexibility, and simplicitly of this approach, here are a few examples with popular datasets on the Hugging Face Hub.
For all of these examples, we will use the following imports:
import fiftyone as fo
import fiftyone.utils.hf_hub as fouh
mnist
Load the test split of MNIST dataset:
dataset = fouh.load_from_hub(
"mnist",
split="test",
format="parquet",
classification_fields="label"
)
session = fo.launch_app(dataset)
Here, "mnist" is the repo id, and we are using classification_fields="label" to specify that the feature called "label" in the Hugging Face dataset should be converted into a FiftyOne Classification label.
coyo-700m
Load the first 1,000 samples from the COYO-700M dataset:
dataset = fouh.load_from_hub(
"kakaobrain/coyo-700m",
format="parquet",
max_samples=1000
)
session = fo.launch_app(dataset)
Here we use max_samples to specify that we only want the first 1,000.
cppe-5
Load the CPPE-5 dataset and persist it to database:
dataset = fouh.load_from_hub(
"cppe-5",
format="parquet",
detection_fields="objects",
persistent=True
)
session = fo.launch_app(dataset)
Here we use detection_fields="objects" to specify that the feature "objects" should be converted into a FiftyOne Detections label field.
scene_parse150
Just load the test split from the scene_parsing subset:
dataset = fouh.load_from_hub(
"scene_parse150",
format="parquet",
subset="scene_parsing",
split="test",
classification_fields="scene_category",
mask_fields="annotation"
)
session = fo.launch_app(dataset)
Here we are using the "split" and "subset" keyword arguments to specify what we want to download. Also note that we are converting multiple features from the Hugging Face dataset into FiftyOne label fields. The segmentation masks are saved to disk.
Documentation: For comprehensive coverage of all of these options, supported datasets, and more, see the PDF version of the integration docs, attached.
Pushing to the Hub
If you are working with a dataset in FiftyOne and you want to quickly share it with others, you can do so via the push_to_hub() function, which takes two positional arguments: the FiftyOne sample collection (a Dataset or DatasetView), and the repo_name, which will be combined with your username/org to construct the repo_id where the dataset will be uploaded.
When you push to the hub, a few things happen:
- The dataset and its media files are exported/uploaded in a specified format. By default, this format is
fiftyone.types.dataset_types.FiftyOneDataset, but you can specify the format you want via thedataset_typekeyword argument. - A
fiftyone.ymlconfig file for the dataset is generated and uploaded, which contains all of the necessary information so that the dataset can be loaded withload_from_hub(). - A Hugging Face Dataset Card for the dataset is auto-generated, providing tags, metadata, license info, and a code snippet illustrating how to load the dataset from the hub.
When you push to the hub, you can specify any/all of the following dataset card and config file attributes:
descriptionlicensetags
push_to_hub supports the following Hugging Face API arguments:
private: whether to upload the dataset as private or publicexist_ok: whether to throw an error if the repo already exists
Example Usage
- Upload the Quickstart dataset to a private repo:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
dataset = foz.load_zoo_dataset("quickstart")
fouh.push_to_hub(dataset, "quickstart")
- Upload this dataset as a public
COCODetectionDatasetwith an MIT license:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot
dataset = foz.load_zoo_dataset("quickstart")
fouh.push_to_hub(
dataset,
"quickstart-coco",
dataset_type=fot.dataset_types.COCODetectionDataset,
private=False,
license="mit",
label_fields="*" ### convert all label fields, not just ground truth
)
- Upload the first 3 samples of a video dataset:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot
dataset = foz.load_zoo_dataset("quickstart-video")
fouh.push_to_hub(
dataset[:3],
"video-test",
private=True,
)
Wish List
- Support different detection formats (like VOC, COCO, etc)
- Extend support for specifying the location of the config file at remote urls, such as github repos
- Extend this "config" based approach to loading from Github, and to working with datasets, models, and plugins
- URL redirects for anchor links from previous version of the integration which just had transformers
- Add robustness to the download process — currently, sometimes when sending requests to the Datasets Server it hangs. This is not great.
- Add support for converting directly from the Hugging Face dataset, assuming it was downloaded with
datasets.load_dataset() - Add functionality to extend an existing dataset with the next batch of samples. In other words, if you used
load_from_hub()withmax_samples=1000, but now you want the remaining samples, it shouldn't need to query the server for the first 1000. Currently, it doesn't re-download media files, but it does re-query the server. - Handle FiftyOne version compatibility: currently,
push_to_hub()sets the required FiftyOne version to the user's version on upload, but this is too restrictive...
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
- [ ] No. You can skip the rest of this section.
- [x] Yes. Give a description of this change to be included in the release notes for FiftyOne users.
(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)
What areas of FiftyOne does this PR affect?
- [ ] App: FiftyOne application changes
- [ ] Build: Build and test infrastructure changes
- [ ] Core: Core
fiftyonePython library changes - [x] Documentation: FiftyOne documentation changes
- [x] Other
Summary by CodeRabbit
- New Features
- Introduced utilities for integrating with Hugging Face, allowing users to push datasets to the Hugging Face Hub and load datasets from the Hub into FiftyOne. This includes support for dataset management, handling metadata, field conversions, and configuring datasets based on Hugging Face Hub specifications.
Walkthrough
The new fiftyone/utils/huggingface.py file introduces a comprehensive set of utilities for seamless integration with the Hugging Face ecosystem, enabling tasks like dataset management, configuration handling, and flexible dataset loading options.
Changes
| File | Change Summary |
|---|---|
fiftyone/utils/huggingface.py |
Introduces utilities for Hugging Face integration, including dataset pushing to the Hub, loading datasets from the Hub, managing repos, uploading datasets, metadata handling, field conversions, and configuration based on Hub dataset settings. Provides flexible dataset loading options. |
🐇✨
In the world of code, where changes abound,
A script emerges with features profound.
Hugging Face now closer, datasets in sync,
This utility brings a new wave, not a blink.
So here's to progress, in lines of code we trace,
Evolving tools that find their place.
🌟🐰
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
Note: Auto-reply has been disabled for this repository by the repository owner. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai generate interesting stats about this repository and render them as a table.@coderabbitai show all the console.log statements in this repository.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger a review. This is useful when automatic reviews are disabled for the repository.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai helpto get help.
Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.24%. Comparing base (
90c8853) to head (fd624fa). Report is 314 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #4193 +/- ##
============================================
+ Coverage 16.00% 99.24% +83.24%
============================================
Files 734 35 -699
Lines 82223 15236 -66987
Branches 1119 0 -1119
============================================
+ Hits 13159 15121 +1962
+ Misses 69064 115 -68949
| Flag | Coverage Δ | |
|---|---|---|
| app | ? |
|
| python | 99.24% <ø> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.