fiftyone icon indicating copy to clipboard operation
fiftyone copied to clipboard

[FR] Load/Create Dataset from xarray

Open asmith26 opened this issue 3 years ago • 10 comments

Proposal Summary

Xarray is a great library for working with labelled multi-dimensional arrays - this feels quite similar to what this (fantastic!) library does, so it would be great to be able to load data from and work with the xarray format.

Motivation

  • I like to store multi-dimensional arrays in zarr format, which would perhaps make moving around fiftyone datasets easier + make working with cloud storage (e.g. AWS S3) easier.
  • Xarray support lazy loading of data for big datasets: https://xarray.pydata.org/en/stable/user-guide/dask.html
  • Xarray seems quite popular and is a part of the PyData stack.

I'm definitely not an expert on the fiftyone library, so please feel free to point me to doc regarding solutions to the above :)

What areas of FiftyOne does this feature affect?

  • [ ] App: FiftyOne application
  • [x] Core: Core fiftyone Python library
  • [ ] Server: FiftyOne server

Details

As an example, suppose I have the following xarray:

import numpy as np
import xarray as xr

data = xr.DataArray(
    data=np.array([[[[1, 1, 0], [2, 2, 0]], [[3, 3, 0], [4, 4, 1]]],
                      [[[10, 10, 0], [20, 20, 0]], [[30, 30, 0], [40, 40, 1]]]]),
    dims=["sample_idx", "height", "width", "band"],
    coords={"band": ["band1", "band2", "label"]},
    attrs={"res": 1}
)

In fiftyone, this may look something like:

import fiftyone as fo

sample = fo.Sample(filepath=None)
sample["X"] = data.sel(band=["band1", "band2"])
sample["Y"] = fo.Segmentation(mask=data.sel(band="label"))

dataset = fo.Dataset(name="training-data")
dataset.add_samples([sample])
dataset.default_mask_targets = {0: "Yes", 1: "No"} 
dataset["res"] = data.res  # not sure if you can store dataset metadata?
dataset.save()

I've had a little look through the fiftyone doc, and possibly this could be solved via https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/datasets.html#custom-formats

Willingness to contribute

  • [x] Yes. I would be willing to contribute this feature with guidance from the FiftyOne community.

Many thanks for any help! :)

asmith26 avatar Feb 16 '22 19:02 asmith26

Hi @asmith26 👋

Just to orient us, here's a couple FiftyOne axioms:

  • FiftyOne is designed for representing collections of visual data (currently images and videos) together with any associated metadata, which can be task-specific (eg detections, segmentations, polylines), or primitive fields (str, float, int, bool) or arbitrary JSON.
  • FiftyOne always "owns" the storage of a dataset's metadata. When you load data into FiftyOne, it is always written to a backing MongoDB database, which is henceforward the source from which that dataset's contents are served when you interact with the FiftyOne App or the Python API
  • The actual media (images and videos) are not stored in MongoDB; only pointers to the images and videos on disk/cloud storage are stored. And, FiftyOne requires random access to these images and videos, eg when viewing a dataset or view into it in the FiftyOne App

That said, if you have an xarray that contains information about a set of images or videos that you want to load into FiftyOne (presumably because you prefer the visualization and query capabilities that FiftyOne provides), then it would absolutely make sense to write a custom importer that will load the xarray into FiftyOne dataset format! 💯 🥇

Since xarray offers very flexible, schema-less storage, the importer would need to either expect that the xarray data satisfies a certain data model, or you'd need a way to define the schema of your xarray in such a way that the data could be transformed into the appropriate FiftyOne Label/field types.

brimoor avatar Feb 16 '22 19:02 brimoor

Thanks very much for this information @brimoor

...(presumably because you prefer the visualization and query capabilities that FiftyOne provides)...

This is what I am after thanks. Regarding how to use a custom importer, I think I understand what this may look like:

import fiftyone as fo

xr_importer = XarrayDatasetImporter(...)
dataset = fo.Dataset.from_importer(xr_importer)

the thing I'm a bit unsure on is how to write XarrayDatasetImporter. I hoping to use this for an image segmentation problem, so I think I need to inherit from foud.LabeledImageDatasetImporter.

I'll give this a go, and thanks again for this fantastic lib! :)

asmith26 avatar Feb 16 '22 20:02 asmith26

The DatasetImporter interface is just a fancy way to encapsulate the code required to transform a certain external/disk representation of some data in a pre-defined format into FiftyOne Sample instances.

The simplest approach is to just start by writing a function that takes your xarray as input and constructs a FiftyOne dataset via a simple Python loop.

Refactoring that function into a DatasetImporter would allow you to use methods like Dataset.from_importer(), Dataset.add_importer(), and Dataset.merge_importer() that provide slight tweaks such as creating a new dataset vs adding to an existing dataset vs merging new fields onto an existing dataset "for free".

The DatasetImporter subclasses like UnlabeledImageDatasetImporter and LabeledImageDatasetImporter just define slightly different interfaces depending on whether you are importing images only or images with one-or-more label fields, respectively.

There is also a GenericSampleDatasetImporter class for situations where you find it easier to just let your importer construct and return entire Sample instances containing arbitrary contents. Since xarray can contain arbitrary data, the most ambitious implementation would be a GenericSampleDatasetImporter that could generate lots of different types of samples depending on what the input array's schema is. GeoJSONDatasetImporter is an example of such an importer.

tldr; I'd recommend just writing a simple utility function first :)

brimoor avatar Feb 16 '22 20:02 brimoor

Thanks very much for all your help @brimoor. I've been trying to implement the simple Python loop method, but I'm struggling to initialize a sample without providing a filepath:

sample = fo.Sample(filepath=filepath)

Just wondering if it is possible to do something like:

import numpy as np
import dask.array as da

sample = fo.Sample(image_data=np.array([...]))
sample = fo.Sample(image_data=da.Array([...]))  # lazy loading

Many thanks again!

asmith26 avatar Feb 17 '22 10:02 asmith26

This is what I was getting at with this point:

The actual media (images and videos) are not stored in MongoDB; only pointers to the images and videos on disk/cloud storage are stored. And, FiftyOne requires random access to these images and videos, eg when viewing a dataset or view into it in the FiftyOne App

In order to use FiftyOne, you'll have to write the images as png/jpg/etc files on disk somewhere and pass those paths to Sample.filepath. FiftyOne isn't an in-memory data format; media is stored on disk and metadata is stored in MongoDB.

brimoor avatar Feb 17 '22 14:02 brimoor

@brimoor Could I get a little more details on why this is not possible, i.e. why can't there be a fetching function, which would get you an image array upon request? Doesn't have to be stored in memory all the time.

What makes the common image formats special, that the functionality can't be extended over an arbitrary object?

mg515 avatar Jul 20 '22 07:07 mg515

Hi @mg515, an image fetching API could probably be added. Some work would need to be done to standardize the way that media is accessed across the entire library though, because there are lots of ways media must be consumed across the library:

brimoor avatar Jul 20 '22 13:07 brimoor

@mg515 Are there any updates on the possibility of adding an image fetching API? We are dealing with a similar case, but we would love to use Fiftyone.

janerikvw avatar Aug 31 '23 07:08 janerikvw

@brimoor, are there any updates on an image fetching API implementation? It would greatly improve the usability of FiftyOne for various dataset formats.

SergeyMilyaev avatar Feb 08 '24 12:02 SergeyMilyaev

@brimoor , @danielgural I note a similar request in slack channel regarding loading from npz. While the custom parser allows to ingest the npz, it would be great to not have to store the images locally in jpg/ png format while they are already available in another format.

hemangchawla avatar Mar 13 '24 08:03 hemangchawla