earthkit-data Refactor array fieldlists

Refactor array fieldlists

Open sandorkertesz opened this issue 5 months ago • 1 comments

This PR introduces the following changes:

Array fieldlists

(array fieldlist = fieldlist containing in-memory fields each with a values array and a metadata object)

Adds SimpleFieldList, which is a fieldlist containing a list of arbitrary fields. Part of it was already implemented under the name FieldArray. FieldArray was kept as an alias because anemoi-datasets uses it. SimpleFieldList is a mutable object because fields can be appended to it with append(). It is a top level object and can be directly imported from earthkit-data. Can be used like this:

from earthkit.data import SimpleFieldList

ds = SimpleFieldList()
for ...
    f = .... # make a field
    ds.append(f)

my_fields = [...]
ds = SimpleFieldList(my_fields)

@staticmethod FieldList.from_fields() creates a SimpleFieldList
Removes ArrayFieldList. Its functionality is implemented by SimpleFieldList
ArrayField becomes a top level object and can be directly imported from earthkit-data
@staticmethod FieldList.from_array() returns a SimpleFieldList containing ArrayFields. Previously it returned an ArrayFieldList.

ds = FieldList.from_array(array_list, metadata_list)

The "list-of-dicts" source returns a SimpleFieldList containing ArrayFields.

Array backends

The array-api-compat package becomes a mandatory dependency. It provides the array namespaces.
Simplifies the array backend implementation and Field/FieldList now does not contain any array backend related objects.
Removes the array_backend option from the FieldList constructor. It means we cannot load a GRIB fieldlist from file/stream which mimics as if its data was stored in a given array backend format. Most probably this feature was not used at all.
We can still create an array fieldlist with the given array backend by using the to_fieldlist() method on a FieldList or by using from_array()

# create an array fieldlist with numpy arrays
ds = from_source("file", "my.grib").to_fieldlist()
ds = from_source("file", "my.grib").to_fieldlist(array_backend="numpy")

# create an array fieldlist with torch tensors
ds = from_source("file", "my.grib").to_fieldlist(array_backend="pytorch")

# create an array fieldlist with torch tensors
array =  ... # torch tensor
md = ... # list of metadata objects
FieldList.from_array(array, md)

Questions/Major changes

can array-api-compat be a mandatory dependency? Or only try to rely on it when other than numpy arrays are used?
mutable SimpleFieldList (append() method)
exposing SimpleFieldList (from earthkit.data import SimpleFieldList)
exposing ArrayField (from earthkit.data import ArrayField)
removal of array_backend from FieldList/from_source()
what should be the preferred way to build an array fieldlist during the computations? The current recommendation is this:

# create an empty fieldlist
ds_r = FieldList()

for f in fs:
    p = f.metadata("level")*100. # hPa -> Pa
    t_new = potential_temperature(f.values, p)
    md_new = f.metadata().override(shortName="pt")
    
    # create new numpy fieldlist with a single field
    ds_new = FieldList.from_array(t_new, md_new)

    # add it to the resulting fieldlist
    ds_r += ds_new

With SimpleFieldList and ArrayField we can rewrite it as:

# create an empty fieldlist
ds_r = SimpleFieldList()

for f in fs:
    p = f.metadata("level")*100. # hPa -> Pa
    t_new = potential_temperature(f.values, p)
    md_new = f.metadata().override(shortName="pt")
    
    # create new numpy field and add it to the resulting fieldlist
    ds_r.append(ArrayField(t_new, md_new))

Or (needs to be implemented) in the previous code we could even use a FieldList instead of SimpleFieldList as a starting point.

# create an empty fieldlist
ds_r = FieldList()

Further questions about ArrayField

Let us suppose f is an ArrayField containing a torch tensor and GRIB metadata. The return type of the following methods are straightforward:

f.to_numpy() -> ndarray f.to_array() -> torch tensor f.values -> torch tensor

However, not sure if the following methods should return a torch tensor or an ndarray:

f.to_latlon() f.to_points() f.data() f.grid_points() f.grid_points_unrotated()

Sep 22 '24 20:09 sandorkertesz

earthkit-data earthkit-data copied to clipboard

Refactor array fieldlists

Array fieldlists

Array backends

Questions/Major changes

Further questions about ArrayField

earthkit-data
earthkit-data copied to clipboard