satpy icon indicating copy to clipboard operation
satpy copied to clipboard

Refactor Scene creation for more flexibility

Open djhoese opened this issue 5 years ago • 7 comments

The Problem

The Scene class in Satpy currently has a lot of responsibilities and a lot of complexity all wrapped into one single class and very few interfaces. While this makes Satpy very easy to use in the simple case, it makes it difficult, ugly, or impossible in the complex cases. It also makes Satpy and the Scene class difficult to maintain (add new features, find bugs, test, etc).

One of the bigger issues is the complexity of reading and loading data from files. This currently involves passing one or more sequences of filenames and one or more reader names to the Scene. A user then needs to call scn.load with the variables/fields/products/composites that they want to read from the provided files. Both the Scene creation and the load call accept arbitrary keyword arguments that get passed down to the lower-level Reader objects and the even lower-level FileHandler instances. Overall this is a lot of complexity wrapped up in two methods.

Example cases that are difficult with the current design:

  1. Providing different creation keyword arguments to readers (also file handler creation) when multiple readers are used at the same time. The only way to currently support this is to create multiple Scene objects with each reader and then join them together. This is relatively easy if the two readers are loading separate datasets, but if you want to load a composite that uses data from both readers it is very hard.
  2. Providing different loading keyword arguments to different file handlers or readers when calling Scene.load. In this case there is a **kwargs that works its way to every file handler's methods for getting the AreaDefinition and the data DataArray object.

Most of the people assigned to this issue had a discussion today over video call about this and this issue is being created to summarize what was talked about and allow for further discussion.

The General Proposal

Some of the functionality of the current Scene class should be extracted to get closer to a single responsibility or at least less responsibilities. This issue is to discuss the specific issue of reading and loading in the Scene and will likely be the first refactor that will result in the Scene acting more like a data container rather than an all-in-one/all-powerful object. Eventually this could mean the Scene being an instance of an xarray.Dataset or a subclass of it if necessary.

The main change discussed during our meeting seemed to revolve around the idea of not providing the reader/filenames to the Scene creation, but instead get the data to the Scene in some other way. If not that, then allow for the readers/data providers to be provided to the Scene instead of the Scene creating them. This would allow users more control and customization over when and how things are created as well as make things easier to maintain by having defined objects/interfaces that are no longer part of the larger Scene object.

The Current Situation

Here are some examples that demonstrate the different ways the Scene can read/load data. Not all of these are realistic for the readers/variables shown, but you get the point.

Case 1 - Single Reader/Simple Load

Main Point: This simple case only takes 4 lines of code to go from input files to resampled output files.

scn = Scene(reader='abi_l1b', filenames=[...])
scn.load(['C01'])
new_scn = scn.resample('my_area')
new_scn.save_datasets(...)

Case 2 - Single Reader/Complex Load

Main Point: Multiple load calls are needed when needs differ for each variable loaded.

scn = Scene(reader='abi_l1b', filenames=[...])
scn.load(['C01'])
scn.load(['C02'], resolution=500)
scn.load(['C03'], some_file_handler_kwarg=False)
scn.load(['C04'], some_file_handler_kwarg=True)
scn.load([DataQuery(name='C03', calibration='radiance')])
new_scn = scn.resample('my_area')
new_scn.save_datasets(...)

Case 3 - Multiple Reader/Simple Composite Load

Main Point: The Scene can use multiple readers at the same time to easily load a composite that uses data from multiple readers.

scn = Scene(filenames={'abi_l1b': [...], 'glm_l2': [...]})
scn.load(['C14_flash_extent_density'])
new_scn = scn.resample('my_area')
new_scn.save_datasets(...)

Case 4 - Multiple Reader/Simple Composite Load/Custom Reader Keyword Args

Main Point: If readers that are created need different creation keyword arguments then we have to put a lot more work into getting something like Case 3 above.

scn = Scene(reader='abi_l1b', filenames=[...], reader_kwargs={'reader_kwarg': True})
scn_glm = Scene(reader='glm_l2', filenames=[...], reader_kwargs={'reader_kwarg': False})

scn_glm.load(['flash_extent_density'])
scn['flash_extent_density'] = scn_glm['flash_extent_density']

scn.load(['C14_flash_extent_density'])
new_scn = scn.resample('my_area')
new_scn.save_datasets(...)

Could also be two separate kwargs. One that is understood by the ABI reader but not by GLM.

Example Partial Solution 1 - Allow passing Reader instances

Take case 4 above:

abi_reader = load_reader('abi_l1b', reader_kwarg=True)
glm_reader = load_reader('glm_l2', reader_kwarg=False)
scn = Scene(filenames={abi_reader: [...], glm_reader: [...]})

scn.load(['C14_flash_extent_density'])
new_scn = scn.resample('my_area')
new_scn.save_datasets(...)

Requires the reader instances to be hashable so they can be dict keys.

Pros: Users have full control over how the readers are treated. Cons: It doesn't change Case 2 (loading kwargs). Doesn't really take the responsibility out of the Scene.

Example Partial Solution 2 - Data Provider

data_provider = LocalDataProvider(reader='abi_l1b', uris=filenames)
scn = data_provider.load(['C01'])
new_scn = scn.resample('my_area')

Pros: Clear separation of responsibilities for data loading versus the Scene as a container. No extra lines of code compared to Case 1. Cons: Doesn't really help with case 4 above. Full support would require the ability to merge providers and likely provide a concatenate/append/join functionality to the Scene data container. This Scene merging would require complex logic of merging dependency trees...not fun.

Note: LocalDataProvider is just an example. It could be a wrapper function that determines what lower-level provider to use or some kind of factory function that produces a provider that could be further customized if needed. :man_shrugging:

Summary

None of the partial solutions or other topics discussed provided a full solution or seemed "perfect". That's what this issue is for. Overall though the data provider concept seemed well received.

Additional Discussion Points

  1. Case 2 above may not be realistic in the case where some_file_handler_kwarg is different for the data being loaded. We couldn't really think of cases where you'd want to have data that has some configuration and data that has some other configuration. Usually these types of operations are pretty data quality centric.
  2. Scene versus a new ResampledScene versus something else. However, ResampledScene and Scene have no difference if the data loaded from the reader are already on the same AreaDefinition.
  3. Factory functions/classes provided to the Scene to generate file handlers or readers as the Scene needs them. This didn't show much benefit to the partial solutions above after further discussion.

Related Issues

  1. #1397

Additional/Related Topics

  1. Loading data from non-local storage (remote S3 bucket, authenticated storage, URL API, etc).
  2. Specifying keyword arguments to writers (__init__ versus save_dataset/s)
  3. Specifying keyword arguments to various steps of adding overlays and decorations before saving images.

djhoese avatar Dec 03 '20 19:12 djhoese

One important thing to focus on in this discussion is (as Martin pointed out): What do we want the user code to look like?

djhoese avatar Dec 03 '20 20:12 djhoese

Thanks for the detailed summary!

mraspaud avatar Dec 04 '20 07:12 mraspaud

Thanks a lot for your efforts everyone, and @djhoese for the nice summary!

adybbroe avatar Dec 04 '20 15:12 adybbroe

Thanks for the write-up! I think a single scene should be able to have multiple readers and multiple data providers. For flexibility and easy of reading, the user shouldn't be forced to add all this information when creating the Scene class, but also be able to add this later.

How about:

cache_fs = CachingFileSystem(...)
local_fs = LocalFileSystem(...)
abi_reader = load_reader('abi_l1b', reader_kwarg=True)
glm_reader = load_reader('glm_l2', reader_kwarg=False)
abi_provider = FileSystemDataProvider(reader=abi_reader, src=filenames, fs=cache_fs)
glm_provider = FileSystemDataProvider(reader=glm_reader, src=filenames, fs=local_fs)
sc = Scene()
sc.add_provider(abi_provider)
sc.add_provider(glm_provider)
sc.load(["C02"], resolution=500)
sc.load([6.2, 7.3], calibration="brightness_temperature")
sc.load(["c14_flash_extent_density"])
ls = sc.resample("iowa")

That doesn't make the scene as lean as some may prefer, but it does separate out some responsibilities out of the scene class and also out of Scene.__init__, which is currently a bit of a beast. I fear that making the scene too lean may make the perfect the enemy of the good and violate »practicality beats purity«.

gerritholl avatar Dec 04 '20 17:12 gerritholl

I assume that the first 6 lines (fs, reader, and provider creation) that those could be combined in the simple cases like we usually see with the current Scene?

What if the data providers and Scene's could be easily combined? For example xarray's update functionality: http://xarray.pydata.org/en/stable/combining.html#update

abi_provider = ...
glm_provider = ...
joined_provider = abi_provider + glm_provider
sc = abi_provider.load(["C02"], resolution=500)
sc.update(glm_provider.load([6.2, 7.3], calibration="brightness_temperature"))
sc.update(joined_provider.load(["c14_flash_extent_density"]))
sc.update(xr_dataset_i_got_from_outside_satpy)

Nothing about the above would stop us from making Scene an instance of xarray.Dataset, except for maybe a custom update method that could handle renaming coordinates for differing resolutions (can't have two x coordinates with different coordinates, one for 500m, one for 1000m).

My suggestion is just another option. Your suggestion does make me realize that this add_provider and load are usually (almost always?) a single occurrence, right? Not that resampling isn't also a one time thing usually, but why do we need to add a provider if we are only going to use it once?

djhoese avatar Dec 04 '20 19:12 djhoese

I assume that the first 6 lines (fs, reader, and provider creation) that those could be combined in the simple cases like we usually see with the current Scene?

Yes.

What if the data providers and Scene's could be easily combined? For example xarray's update functionality: http://xarray.pydata.org/en/stable/combining.html#update

abi_provider = ...
glm_provider = ...
joined_provider = abi_provider + glm_provider
sc = abi_provider.load(["C02"], resolution=500)
sc.update(glm_provider.load([6.2, 7.3], calibration="brightness_temperature"))
sc.update(joined_provider.load(["c14_flash_extent_density"]))
sc.update(xr_dataset_i_got_from_outside_satpy)

Nothing about the above would stop us from making Scene an instance of xarray.Dataset, except for maybe a custom update method that could handle renaming coordinates for differing resolutions (can't have two x coordinates with different coordinates, one for 500m, one for 1000m).

I haven't really thought deeply enough about whether a scene should be a dataset and I think it's not quite the same question as we're discussing here. I'm not entirely convinced, in particular with dimension names between different channels. And we can currently do some things with scenes (such as slicing) that xarray datasets don't do. I feel the definition of an xarray dataset may be a bit too restrictive for a scene.

My suggestion is just another option. Your suggestion does make me realize that this add_provider and load are usually (almost always?) a single occurrence, right? Not that resampling isn't also a one time thing usually, but why do we need to add a provider if we are only going to use it once?

Good point. Maybe there should also be an API where the user can pass the filenames and the channels to be loaded in one go?

scene = scenemaker(filenames, reader="abi_l1b", datasets=["c02", "true_color"])

or even

resampled_scene = scenemaker(filenames, reader="abi_l1b", datasets=["c02", "true_color"], area="iowa")

probably covers most use cases.

Such a utility function could actually exist independently of any large changes in the underlying API. It could exist today.

gerritholl avatar Dec 07 '20 08:12 gerritholl

In some cases, a user may want the flexibility to calculate a composite from data that didn't come directly from a file. Synthetic example:

sc = Scene()
sc["flash_extent_density"] = xarray.DataArray(np.arange(100).reshape(10, 10), dims=("y", "x"))
sc["C14"] = xarray.DataArray(np.arange(100).reshape(10, 10), dims=("y", "x"))
sc.load(["C14_flash_extent_density"])

the latter fails, even though all the prerequisite information is available, but Satpy doesn't know how to calculate it. A more realistic case in which this may occur: I have a time series including one ABI file every ten minutes and one GLM file every minute. I want to load flash_extent_density from the GLM files, sum them so they cover ten minutes, then get a MultiScene with one ABI, one GLM per Scene. One approach suggested by @djhoese was to create a MultiScene corresponding to each ABI scene but containing only the ten GLM files, then assign the result of ms.blend(...) to a custom Scene (sc["flash_extent_density"] = ms.blend(...)); but if assigned manually like this, this scene would not know how to calculate composites. (Maybe if I include the ABI file with the multiscene, I can then load composites from the blended scene, but I'm not sure if that would work.)

Incidentally: composites aren't loaded. They're calculated from their dependencies. In case of composites, the term load is misleading or confusing.

gerritholl avatar Jun 30 '21 15:06 gerritholl