earthkit-data
earthkit-data copied to clipboard
Refactor array fieldlists
This PR introduces the following changes:
Array fieldlists
(array fieldlist = fieldlist containing in-memory fields each with a values array and a metadata object)
- Adds
SimpleFieldList
, which is a fieldlist containing a list of arbitrary fields. Part of it was already implemented under the nameFieldArray
.FieldArray
was kept as an alias becauseanemoi-datasets
uses it.SimpleFieldList
is a mutable object because fields can be appended to it withappend()
. It is a top level object and can be directly imported from earthkit-data. Can be used like this:
from earthkit.data import SimpleFieldList
ds = SimpleFieldList()
for ...
f = .... # make a field
ds.append(f)
my_fields = [...]
ds = SimpleFieldList(my_fields)
- @staticmethod
FieldList.from_fields()
creates aSimpleFieldList
- Removes
ArrayFieldList
. Its functionality is implemented bySimpleFieldList
-
ArrayField
becomes a top level object and can be directly imported from earthkit-data - @staticmethod
FieldList.from_array()
returns aSimpleFieldList
containingArrayField
s. Previously it returned anArrayFieldList
.
ds = FieldList.from_array(array_list, metadata_list)
- The "list-of-dicts" source returns a
SimpleFieldList
containingArrayField
s.
Array backends
- The
array-api-compat
package becomes a mandatory dependency. It provides the array namespaces. - Simplifies the array backend implementation and Field/FieldList now does not contain any array backend related objects.
- Removes the
array_backend
option from the FieldList constructor. It means we cannot load a GRIB fieldlist from file/stream which mimics as if its data was stored in a given array backend format. Most probably this feature was not used at all. - We can still create an array fieldlist with the given array backend by using the
to_fieldlist()
method on a FieldList or by usingfrom_array()
# create an array fieldlist with numpy arrays
ds = from_source("file", "my.grib").to_fieldlist()
ds = from_source("file", "my.grib").to_fieldlist(array_backend="numpy")
# create an array fieldlist with torch tensors
ds = from_source("file", "my.grib").to_fieldlist(array_backend="pytorch")
# create an array fieldlist with torch tensors
array = ... # torch tensor
md = ... # list of metadata objects
FieldList.from_array(array, md)
Questions/Major changes
- can
array-api-compat
be a mandatory dependency? Or only try to rely on it when other than numpy arrays are used? - mutable
SimpleFieldList
(append()
method) - exposing
SimpleFieldList
(from earthkit.data import SimpleFieldList
) - exposing
ArrayField
(from earthkit.data import ArrayField
) - removal of
array_backend
from FieldList/from_source() - what should be the preferred way to build an array fieldlist during the computations? The current recommendation is this:
# create an empty fieldlist
ds_r = FieldList()
for f in fs:
p = f.metadata("level")*100. # hPa -> Pa
t_new = potential_temperature(f.values, p)
md_new = f.metadata().override(shortName="pt")
# create new numpy fieldlist with a single field
ds_new = FieldList.from_array(t_new, md_new)
# add it to the resulting fieldlist
ds_r += ds_new
With SimpleFieldList
and ArrayField
we can rewrite it as:
# create an empty fieldlist
ds_r = SimpleFieldList()
for f in fs:
p = f.metadata("level")*100. # hPa -> Pa
t_new = potential_temperature(f.values, p)
md_new = f.metadata().override(shortName="pt")
# create new numpy field and add it to the resulting fieldlist
ds_r.append(ArrayField(t_new, md_new))
Or (needs to be implemented) in the previous code we could even use a FieldList
instead of SimpleFieldList
as a starting point.
# create an empty fieldlist
ds_r = FieldList()
Further questions about ArrayField
Let us suppose f is an ArrayField
containing a torch tensor and GRIB metadata. The return type of the following methods are straightforward:
f.to_numpy() -> ndarray f.to_array() -> torch tensor f.values -> torch tensor
However, not sure if the following methods should return a torch tensor or an ndarray:
f.to_latlon() f.to_points() f.data() f.grid_points() f.grid_points_unrotated()