fiftyone
fiftyone copied to clipboard
[FR] Load/Create Dataset from xarray
Proposal Summary
Xarray is a great library for working with labelled multi-dimensional arrays - this feels quite similar to what this (fantastic!) library does, so it would be great to be able to load data from and work with the xarray format.
Motivation
- I like to store multi-dimensional arrays in zarr format, which would perhaps make moving around fiftyone datasets easier + make working with cloud storage (e.g. AWS S3) easier.
- Xarray support lazy loading of data for big datasets: https://xarray.pydata.org/en/stable/user-guide/dask.html
- Xarray seems quite popular and is a part of the PyData stack.
I'm definitely not an expert on the fiftyone
library, so please feel free to point me to doc regarding solutions to the above :)
What areas of FiftyOne does this feature affect?
- [ ] App: FiftyOne application
- [x] Core: Core
fiftyone
Python library - [ ] Server: FiftyOne server
Details
As an example, suppose I have the following xarray:
import numpy as np
import xarray as xr
data = xr.DataArray(
data=np.array([[[[1, 1, 0], [2, 2, 0]], [[3, 3, 0], [4, 4, 1]]],
[[[10, 10, 0], [20, 20, 0]], [[30, 30, 0], [40, 40, 1]]]]),
dims=["sample_idx", "height", "width", "band"],
coords={"band": ["band1", "band2", "label"]},
attrs={"res": 1}
)
In fiftyone, this may look something like:
import fiftyone as fo
sample = fo.Sample(filepath=None)
sample["X"] = data.sel(band=["band1", "band2"])
sample["Y"] = fo.Segmentation(mask=data.sel(band="label"))
dataset = fo.Dataset(name="training-data")
dataset.add_samples([sample])
dataset.default_mask_targets = {0: "Yes", 1: "No"}
dataset["res"] = data.res # not sure if you can store dataset metadata?
dataset.save()
I've had a little look through the fiftyone doc, and possibly this could be solved via https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/datasets.html#custom-formats
Willingness to contribute
- [x] Yes. I would be willing to contribute this feature with guidance from the FiftyOne community.
Many thanks for any help! :)
Hi @asmith26 👋
Just to orient us, here's a couple FiftyOne axioms:
- FiftyOne is designed for representing collections of visual data (currently images and videos) together with any associated metadata, which can be task-specific (eg detections, segmentations, polylines), or primitive fields (str, float, int, bool) or arbitrary JSON.
- FiftyOne always "owns" the storage of a dataset's metadata. When you load data into FiftyOne, it is always written to a backing MongoDB database, which is henceforward the source from which that dataset's contents are served when you interact with the FiftyOne App or the Python API
- The actual media (images and videos) are not stored in MongoDB; only pointers to the images and videos on disk/cloud storage are stored. And, FiftyOne requires random access to these images and videos, eg when viewing a dataset or view into it in the FiftyOne App
That said, if you have an xarray
that contains information about a set of images or videos that you want to load into FiftyOne (presumably because you prefer the visualization and query capabilities that FiftyOne provides), then it would absolutely make sense to write a custom importer that will load the xarray into FiftyOne dataset format! 💯 🥇
Since xarray
offers very flexible, schema-less storage, the importer would need to either expect that the xarray data satisfies a certain data model, or you'd need a way to define the schema of your xarray in such a way that the data could be transformed into the appropriate FiftyOne Label
/field types.
Thanks very much for this information @brimoor
...(presumably because you prefer the visualization and query capabilities that FiftyOne provides)...
This is what I am after thanks. Regarding how to use a custom importer, I think I understand what this may look like:
import fiftyone as fo
xr_importer = XarrayDatasetImporter(...)
dataset = fo.Dataset.from_importer(xr_importer)
the thing I'm a bit unsure on is how to write XarrayDatasetImporter
. I hoping to use this for an image segmentation problem, so I think I need to inherit from foud.LabeledImageDatasetImporter
.
I'll give this a go, and thanks again for this fantastic lib! :)
The DatasetImporter
interface is just a fancy way to encapsulate the code required to transform a certain external/disk representation of some data in a pre-defined format into FiftyOne Sample
instances.
The simplest approach is to just start by writing a function that takes your xarray
as input and constructs a FiftyOne dataset via a simple Python loop.
Refactoring that function into a DatasetImporter
would allow you to use methods like Dataset.from_importer()
, Dataset.add_importer()
, and Dataset.merge_importer()
that provide slight tweaks such as creating a new dataset vs adding to an existing dataset vs merging new fields onto an existing dataset "for free".
The DatasetImporter
subclasses like UnlabeledImageDatasetImporter
and LabeledImageDatasetImporter
just define slightly different interfaces depending on whether you are importing images only or images with one-or-more label fields, respectively.
There is also a GenericSampleDatasetImporter
class for situations where you find it easier to just let your importer construct and return entire Sample
instances containing arbitrary contents. Since xarray
can contain arbitrary data, the most ambitious implementation would be a GenericSampleDatasetImporter
that could generate lots of different types of samples depending on what the input array's schema is. GeoJSONDatasetImporter is an example of such an importer.
tldr; I'd recommend just writing a simple utility function first :)
Thanks very much for all your help @brimoor. I've been trying to implement the simple Python loop method, but I'm struggling to initialize a sample without providing a filepath:
sample = fo.Sample(filepath=filepath)
Just wondering if it is possible to do something like:
import numpy as np
import dask.array as da
sample = fo.Sample(image_data=np.array([...]))
sample = fo.Sample(image_data=da.Array([...])) # lazy loading
Many thanks again!
This is what I was getting at with this point:
The actual media (images and videos) are not stored in MongoDB; only pointers to the images and videos on disk/cloud storage are stored. And, FiftyOne requires random access to these images and videos, eg when viewing a dataset or view into it in the FiftyOne App
In order to use FiftyOne, you'll have to write the images as png/jpg/etc files on disk somewhere and pass those paths to Sample.filepath
. FiftyOne isn't an in-memory data format; media is stored on disk and metadata is stored in MongoDB.
@brimoor Could I get a little more details on why this is not possible, i.e. why can't there be a fetching function, which would get you an image array upon request? Doesn't have to be stored in memory all the time.
What makes the common image formats special, that the functionality can't be extended over an arbitrary object?
Hi @mg515, an image fetching API could probably be added. Some work would need to be done to standardize the way that media is accessed across the entire library though, because there are lots of ways media must be consumed across the library:
- The App would need to use the fetch API whenever it requests media for either its grid view or expanded modal
- Methods like apply_model() and compute_embeddings() that feed images to data loaders
- Things like annotation integrations that upload media to other services
@mg515 Are there any updates on the possibility of adding an image fetching API? We are dealing with a similar case, but we would love to use Fiftyone.
@brimoor, are there any updates on an image fetching API implementation? It would greatly improve the usability of FiftyOne for various dataset formats.
@brimoor , @danielgural I note a similar request in slack channel regarding loading from npz
. While the custom parser allows to ingest the npz
, it would be great to not have to store the images locally in jpg/ png format while they are already available in another format.