sup3r icon indicating copy to clipboard operation
sup3r copied to clipboard

Bnb/dh refactor

Open bnb32 opened this issue 8 months ago • 0 comments

Ok, here we go...

sup3r/preprocessing was previously just data handlers and batch handlers, essentially. Now we have Loaders, Extracters, Derivers, Cachers which are composed in sup3r.preprocessing.data_handlers.factory to build objects similar to the old DataHandlers. These do basically everything the old handlers used to do, except for training / batching related routines like sampling, normalization, etc. Loaders just load netcdf / h5 data into a xr.Dataset - like container. Extracters extract spatiotemporal regions of data. Derivers derive new features from raw feature data. Cachers, well, they cache data to either h5 or netcdf depending on the extension of the output file provided.

In sup3r/preprocessing we additionally have Samplers and BatchQueues. These are composed in sup3r.preprocessing.batch_handlers.factory to build objects similar to the old BatchHandlers. These do basically everything that the old batch handlers used to do, with some exceptions. The most notable exception is probably that they don't split data into training and validation sets. BatchHandler objects will take "collections" of data handler like objects (these can be DataHandlers, Extracters, Derivers, etc) for both training and validation and separate batch queues will be used for each. Samplers simply contain a xr.Dataset - like object and sample that data as an iterator. BatchQueue objects interface with samplers to keep a queue full of batches / samples while models are training.

All these smaller objects like loaders, extracters, derivers, samplers are built on top of xr.Dataset - like objects (sup3r.preprocessing.accessor.Sup3rX and sup3r.preprocessing.base.Sup3rDataset) which serve as the familiar .data attribute for data and batch handlers. Sup3rDataset is wrapped around Sup3rX to provide an interface for "dual" dataset objects contained by dual handlers and acts exactly like Sup3rX when datasets are not dual. Sup3rX is an xr.Dataset "accessor" class, which is the recommended way to extend xr.Datasets (as opposed to subclassing). These Sup3rX / Sup3rDataset objects act similar to xr.Datasets but with extended functionality. The tests in tests/data_wrappers/ show how to interact with these objects.

Since the fundamental dataset objects are now xr.Dataset - like, they can use dask arrays to store data. This means we don't need to load data into memory until we need the result of a computation. ForwardPassStrategy and ForwardPass have been updated accordingly, since we can lazy load the full input dataset and then index the data handler .data attribute to select generator input chunks, all before loading into memory. BatchHandler objects have a mode argument which can be set to either lazy (load batches into memory only when they are sent out for training) or eager (load .data into memory upon handler initialization).

Tests have been added for all new preprocessing modules and lots of documentation / notes have been added throughout. Tests should hopefully provide good examples of use patterns for these new objects.

bnb32 avatar Jun 23 '24 02:06 bnb32