gpm_api icon indicating copy to clipboard operation
gpm_api copied to clipboard

[FEATURE] Autodetect and Open files

Open kmuehlbauer opened this issue 3 months ago • 2 comments

Is your feature request related to a problem? Please describe.

I've downloaded and cropped files in another workflow. Now I want to open this file piping it through gpm-api machinery. I'm trying

ds = gpm.open_granule_dataset(myfile, scan_mode="FS")

but this doesn't work, as the checks on filename and more are failing for obvious reasons (filename doesn't include any meaningful information).

Describe the solution you'd like

I`d like gpm-api to be able to load arbitrary named files at arbitrary locations and detect the information by accessing the files metadata and acquire all contents into a Dataset/DataTree.

Describe alternatives you've considered I've locally patched open_granule_dataset to take scan_mode and product as arguments and I'm preventing running the checks in those cases. But that seems a solution at the wrong place.

Additional context This would enable gpm-api to be used more versatile by accessing arbitrary data, which still has to conform to the standards.

kmuehlbauer avatar Aug 14 '25 10:08 kmuehlbauer

Hey @kmuehlbauer.

I see what you are looking for, and indeed it is something it has been asked to me also by other people e.g. of the GPM-GV group.

Currently you can open a GPM file (with whatever filename) into a DataTree using open_raw_datatree. However open_raw_datatree just open the data and adds dimension names.

With the following code you can obtain what GPM-API would return using gpm.open_* functions:

from gpm.dataset.datatree import  open_raw_datatree
from gpm.dataset.granule import _get_scan_mode_dataset
from gpm.dataset.conventions import finalize_dataset

# Open datatree
dt = open_raw_datatree(filepath, **xarray_kwargs)

# Extract scan-mode dataset (flattening-out subgroups)
scan_mode="FS"
ds= _get_scan_mode_dataset(
    dt=dt,
    scan_mode=scan_mode,
)

# Finalize dataset
product = "2A-DPR" # could be inferred from file metadata ... 
ds = finalize_dataset(
        ds=ds,
        product= product,
        scan_mode=scan_mode,
        decode_cf=True,

However I guess you suggesting to create a function à la open_mfdataset, accepting multiple filepaths and returning a dataset/datatree equivalent to what is returned by gpm.open_* functions right?

In that case the various code components are already present. We could just do:

import xarray as xr 
from gpm.dataset.dataset import _get_scan_modes_datasets_and_closers
from gpm.dataset.conventions import finalize_dataset

scan_modes = ["FS"] # in future not needed or optional
product = "2A-DPR" # in future not needed

dict_scan_modes, list_dt_closers = _get_scan_modes_datasets_and_closers(
       filepaths=filepaths,
       parallel=parallel,
       scan_modes=scan_modes,
       decode_cf=False,
       chunks=-1,
       # Custom options
       # variables=variables,
       # groups=groups,
       # prefix_group=prefix_group,
       **xarray_kwargs,
   )

# Finalize datatree
dict_scan_modes = {
     scan_mode: finalize_dataset(
         ds=ds,
         product=product,
         scan_mode=scan_mode,
         decode_cf=True,
     )
     for scan_mode, ds in dict_scan_modes.items()
 }

# Create datatree
dt = xr.DataTree.from_dict(dict_scan_modes)

# Specify scan modes closers
for scan_mode, ds in dict_scan_modes.items():
   dt[scan_mode].set_close(ds._close)

# Specify files closers
dt.set_close(partial(_multi_file_closer, list_dt_closers))

To avoid users to specify scan_modes, here we should just add if scan_modes is None: <code to retrieve datatree scan modes>.

Similarly, product could be inferred from the datatree global attributes, but I guess we need to create a dictionary mapping from GPM product names to GPM-API product names.

Are you willing to draft a PR? I will not have time to work on that in the coming 2-3 weeks for sure unfortunately ...

ghiggi avatar Aug 18 '25 08:08 ghiggi

Thanks @ghiggi, I'll work with your small example for now. Let's see, if I can come up with something useful. But I have similar time constraints.

kmuehlbauer avatar Aug 19 '25 05:08 kmuehlbauer