fiftyone Revamped Hugging Face Hub Integration

First draft of the new Hugging Face Hub integration.

Hugging Face Integration — FiftyOne 0.24.0 documentation.pdf

Overview

This integration introduces two functions: load_from_hub() and push_to_hub().

The major architectural changes from the previous version are as follows:

Use configs fiftyone.yml, fiftyone.yaml, or a custom local yaml config file (can be extended to github as well) specifying the format of the dataset to be converted into FiftyOne. This is in contrast to the previous version of the integration, which had a default converter that tried its best but was not guaranteed to succeed, and custom Python loading script execution, which we can still add in at some point, but is not necessarily the lowest barrier to entry
As opposed to using HF's datasets library, loading the dataset with their load_dataset() function, and then converting — which was limiting our flexibility, requiring massive downloads in many cases, and resulting in duplication of a ton of files, this version uses the HF Datasets Server API to directly request and load the relevant data within needing to go through the datasets Python library. This gives a few key additional advantages, as I will document below.
Config info can be specified exclusively via kwargs, removing the need to explicitly create a fiftyone.yml file locally.
Supports loading datasets from gated repos

Additional improvements:

Added Logging during the loading & conversion process
description, license, and repo_id from the HF config file are added to the FiftyOne dataset's info dictionary, and all tags listed in the config are added to the dataset's tags.
Support for thumbnails in the sample grid, additional media fields, and segmentation mask labels
Includes FiftyOne version validation, ensuring that the dataset to be loaded is compatible with the user's current version of fiftyone

Loading from the Hub

The load_from_hub() utility in the hf_hub utils allows you to load datasets from the Hugging Face Hub that are in either:

Parquet format — compatible with the Hugging Face datasets library, and accessible via the Datasets Server API
Any FiftyOne dataset type, from those listed here

When you use load_from_hub(), you must specify the repo_id, which specifies the organization and repo on Hugging Face Hub where the dataset can be found. This is the only positional argument.

The loading config also needs to be specified in one of three ways:

Via a fiftyone.yml or fiftyone.yaml file in the Hugging Face repo itself
Using the config_file keyword argument to specify the location (locally) of the config file to use
By passing the config params directly into the load_from_hub() call via keywords.

The only required element is a format specifier. For Parquet datasets, you can use format="ParquetFilesDataset" or format="parquet" for short. For FiftyOne formats, use the name of the class. For instance, for a dataset in the format fiftyone.types.dataset_types.COCODetectionDataset, use format="COCODetectionDataset".

Loading Arguments

Additionally, the user can specify options:

revision: the revision (or version commit) or the dataset to load
split or splits: which of the available splits they want to load.
subset or subsets: which of the available subsets they want to load. Many datasets on the hub have multiple subsets. As an example, check out the newyorker_caption_contest, which has 30 subsets.
max_samples: the maximum number of samples per <split, subset> pair to load. This can be useful if you want to rapidly get a feel for the dataset without downloading 100s of GBs of data.
batch_size: the batch size to use when requesting data from the datasets server and adding samples to the FiftyOne dataset
num_workers: thread pool workers to use when downloading media
overwrite: whether to overwrite existing documents for the dataset
persistent: whether to persist the loaded dataset to disk
name: a name to use for the dataset. If included, this will override any name present in the config file.

Example Usage

To illustrate the power, flexibility, and simplicitly of this approach, here are a few examples with popular datasets on the Hugging Face Hub.

For all of these examples, we will use the following imports:

import fiftyone as fo
import fiftyone.utils.hf_hub as fouh

mnist

Load the test split of MNIST dataset:

dataset = fouh.load_from_hub(
   "mnist", 
   split="test", 
   format="parquet", 
   classification_fields="label"
)

session = fo.launch_app(dataset)

Here, "mnist" is the repo id, and we are using classification_fields="label" to specify that the feature called "label" in the Hugging Face dataset should be converted into a FiftyOne Classification label.

coyo-700m

Load the first 1,000 samples from the COYO-700M dataset:

dataset = fouh.load_from_hub(
   "kakaobrain/coyo-700m", 
   format="parquet", 
   max_samples=1000
)

session = fo.launch_app(dataset)

Here we use max_samples to specify that we only want the first 1,000.

cppe-5

Load the CPPE-5 dataset and persist it to database:

dataset = fouh.load_from_hub(
   "cppe-5", 
   format="parquet", 
   detection_fields="objects",
   persistent=True
)

session = fo.launch_app(dataset)

Here we use detection_fields="objects" to specify that the feature "objects" should be converted into a FiftyOne Detections label field.

scene_parse150

Just load the test split from the scene_parsing subset:

dataset = fouh.load_from_hub(
   "scene_parse150", 
   format="parquet", 
   subset="scene_parsing",
   split="test",
   classification_fields="scene_category",
   mask_fields="annotation"
)

session = fo.launch_app(dataset)

Here we are using the "split" and "subset" keyword arguments to specify what we want to download. Also note that we are converting multiple features from the Hugging Face dataset into FiftyOne label fields. The segmentation masks are saved to disk.

Documentation: For comprehensive coverage of all of these options, supported datasets, and more, see the PDF version of the integration docs, attached.

Pushing to the Hub

If you are working with a dataset in FiftyOne and you want to quickly share it with others, you can do so via the push_to_hub() function, which takes two positional arguments: the FiftyOne sample collection (a Dataset or DatasetView), and the repo_name, which will be combined with your username/org to construct the repo_id where the dataset will be uploaded.

When you push to the hub, a few things happen:

The dataset and its media files are exported/uploaded in a specified format. By default, this format is fiftyone.types.dataset_types.FiftyOneDataset, but you can specify the format you want via the dataset_type keyword argument.
A fiftyone.yml config file for the dataset is generated and uploaded, which contains all of the necessary information so that the dataset can be loaded with load_from_hub().
A Hugging Face Dataset Card for the dataset is auto-generated, providing tags, metadata, license info, and a code snippet illustrating how to load the dataset from the hub.

When you push to the hub, you can specify any/all of the following dataset card and config file attributes:

description
license
tags

push_to_hub supports the following Hugging Face API arguments:

private: whether to upload the dataset as private or public
exist_ok: whether to throw an error if the repo already exists

Example Usage

Upload the Quickstart dataset to a private repo:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh

dataset = foz.load_zoo_dataset("quickstart")

fouh.push_to_hub(dataset, "quickstart")

Upload this dataset as a public COCODetectionDataset with an MIT license:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot

dataset = foz.load_zoo_dataset("quickstart")

fouh.push_to_hub(
    dataset,
    "quickstart-coco",
    dataset_type=fot.dataset_types.COCODetectionDataset,
    private=False,
    license="mit",
    label_fields="*" ### convert all label fields, not just ground truth
 )

Upload the first 3 samples of a video dataset:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot

dataset = foz.load_zoo_dataset("quickstart-video")

fouh.push_to_hub(
    dataset[:3],
    "video-test",
    private=True,
 )

Wish List

Support different detection formats (like VOC, COCO, etc)
Extend support for specifying the location of the config file at remote urls, such as github repos
Extend this "config" based approach to loading from Github, and to working with datasets, models, and plugins
URL redirects for anchor links from previous version of the integration which just had transformers
Add robustness to the download process — currently, sometimes when sending requests to the Datasets Server it hangs. This is not great.
Add support for converting directly from the Hugging Face dataset, assuming it was downloaded with datasets.load_dataset()
Add functionality to extend an existing dataset with the next batch of samples. In other words, if you used load_from_hub() with max_samples=1000, but now you want the remaining samples, it shouldn't need to query the server for the first 1000. Currently, it doesn't re-download media files, but it does re-query the server.
Handle FiftyOne version compatibility: currently, push_to_hub() sets the required FiftyOne version to the user's version on upload, but this is too restrictive...

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

[ ] No. You can skip the rest of this section.
[x] Yes. Give a description of this change to be included in the release notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

[ ] App: FiftyOne application changes
[ ] Build: Build and test infrastructure changes
[ ] Core: Core fiftyone Python library changes
[x] Documentation: FiftyOne documentation changes
[x] Other

Summary by CodeRabbit

New Features
- Introduced utilities for integrating with Hugging Face, allowing users to push datasets to the Hugging Face Hub and load datasets from the Hub into FiftyOne. This includes support for dataset management, handling metadata, field conversions, and configuring datasets based on Hugging Face Hub specifications.

Mar 25 '24 04:03 jacobmarks

Walkthrough

The new fiftyone/utils/huggingface.py file introduces a comprehensive set of utilities for seamless integration with the Hugging Face ecosystem, enabling tasks like dataset management, configuration handling, and flexible dataset loading options.

Changes

File	Change Summary
`fiftyone/utils/huggingface.py`	Introduces utilities for Hugging Face integration, including dataset pushing to the Hub, loading datasets from the Hub, managing repos, uploading datasets, metadata handling, field conversions, and configuration based on Hub dataset settings. Provides flexible dataset loading options.

🐇✨
In the world of code, where changes abound,
A script emerges with features profound.
Hugging Face now closer, datasets in sync,
This utility brings a new wave, not a blink.
So here's to progress, in lines of code we trace,
Evolving tools that find their place.
🌟🐰

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Note: Auto-reply has been disabled for this repository by the repository owner. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Mar 25 '24 04:03 coderabbitai[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.24%. Comparing base (90c8853) to head (fd624fa). Report is 314 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #4193       +/-   ##
============================================
+ Coverage    16.00%   99.24%   +83.24%     
============================================
  Files          734       35      -699     
  Lines        82223    15236    -66987     
  Branches      1119        0     -1119     
============================================
+ Hits         13159    15121     +1962     
+ Misses       69064      115    -68949

Flag	Coverage Δ
app	`?`
python	`99.24% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 25 '24 04:03 codecov[bot]

fiftyone fiftyone copied to clipboard

Revamped Hugging Face Hub Integration

Overview

Additional improvements:

Loading from the Hub

Loading Arguments

Example Usage

mnist

coyo-700m

cppe-5

scene_parse150

Pushing to the Hub

Example Usage

Wish List

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

What areas of FiftyOne does this PR affect?

Summary by CodeRabbit

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

Codecov Report

fiftyone
fiftyone copied to clipboard

CodeRabbit Configration File (`.coderabbit.yaml`)