earthkit-data icon indicating copy to clipboard operation
earthkit-data copied to clipboard

Refactor array fieldlists

Open sandorkertesz opened this issue 5 months ago • 1 comments

This PR introduces the following changes:

Array fieldlists

(array fieldlist = fieldlist containing in-memory fields each with a values array and a metadata object)

  • Adds SimpleFieldList, which is a fieldlist containing a list of arbitrary fields. Part of it was already implemented under the name FieldArray. FieldArray was kept as an alias because anemoi-datasets uses it. SimpleFieldList is a mutable object because fields can be appended to it with append(). It is a top level object and can be directly imported from earthkit-data. Can be used like this:
from earthkit.data import SimpleFieldList

ds = SimpleFieldList()
for ...
    f = .... # make a field
    ds.append(f)

my_fields = [...]
ds = SimpleFieldList(my_fields)
  • @staticmethod FieldList.from_fields() creates a SimpleFieldList
  • Removes ArrayFieldList. Its functionality is implemented by SimpleFieldList
  • ArrayField becomes a top level object and can be directly imported from earthkit-data
  • @staticmethod FieldList.from_array() returns a SimpleFieldList containing ArrayFields. Previously it returned an ArrayFieldList.
ds = FieldList.from_array(array_list, metadata_list)

  • The "list-of-dicts" source returns a SimpleFieldList containing ArrayFields.

Array backends

  • The array-api-compat package becomes a mandatory dependency. It provides the array namespaces.
  • Simplifies the array backend implementation and Field/FieldList now does not contain any array backend related objects.
  • Removes the array_backend option from the FieldList constructor. It means we cannot load a GRIB fieldlist from file/stream which mimics as if its data was stored in a given array backend format. Most probably this feature was not used at all.
  • We can still create an array fieldlist with the given array backend by using the to_fieldlist() method on a FieldList or by using from_array()
# create an array fieldlist with numpy arrays
ds = from_source("file", "my.grib").to_fieldlist()
ds = from_source("file", "my.grib").to_fieldlist(array_backend="numpy")

# create an array fieldlist with torch tensors
ds = from_source("file", "my.grib").to_fieldlist(array_backend="pytorch")

# create an array fieldlist with torch tensors
array =  ... # torch tensor
md = ... # list of metadata objects
FieldList.from_array(array, md)

Questions/Major changes

  • can array-api-compat be a mandatory dependency? Or only try to rely on it when other than numpy arrays are used?
  • mutable SimpleFieldList (append() method)
  • exposing SimpleFieldList (from earthkit.data import SimpleFieldList)
  • exposing ArrayField (from earthkit.data import ArrayField)
  • removal of array_backend from FieldList/from_source()
  • what should be the preferred way to build an array fieldlist during the computations? The current recommendation is this:
# create an empty fieldlist
ds_r = FieldList()

for f in fs:
    p = f.metadata("level")*100. # hPa -> Pa
    t_new = potential_temperature(f.values, p)
    md_new = f.metadata().override(shortName="pt")
    
    # create new numpy fieldlist with a single field
    ds_new = FieldList.from_array(t_new, md_new)

    # add it to the resulting fieldlist
    ds_r += ds_new

With SimpleFieldList and ArrayField we can rewrite it as:

# create an empty fieldlist
ds_r = SimpleFieldList()

for f in fs:
    p = f.metadata("level")*100. # hPa -> Pa
    t_new = potential_temperature(f.values, p)
    md_new = f.metadata().override(shortName="pt")
    
    # create new numpy field and add it to the resulting fieldlist
    ds_r.append(ArrayField(t_new, md_new))

Or (needs to be implemented) in the previous code we could even use a FieldList instead of SimpleFieldList as a starting point.

# create an empty fieldlist
ds_r = FieldList()

Further questions about ArrayField

Let us suppose f is an ArrayField containing a torch tensor and GRIB metadata. The return type of the following methods are straightforward:

f.to_numpy() -> ndarray f.to_array() -> torch tensor f.values -> torch tensor

However, not sure if the following methods should return a torch tensor or an ndarray:

f.to_latlon() f.to_points() f.data() f.grid_points() f.grid_points_unrotated()

sandorkertesz avatar Sep 22 '24 20:09 sandorkertesz