[FEATURE] Autodetect and Open files
Is your feature request related to a problem? Please describe.
I've downloaded and cropped files in another workflow. Now I want to open this file piping it through gpm-api machinery. I'm trying
ds = gpm.open_granule_dataset(myfile, scan_mode="FS")
but this doesn't work, as the checks on filename and more are failing for obvious reasons (filename doesn't include any meaningful information).
Describe the solution you'd like
I`d like gpm-api to be able to load arbitrary named files at arbitrary locations and detect the information by accessing the files metadata and acquire all contents into a Dataset/DataTree.
Describe alternatives you've considered
I've locally patched open_granule_dataset to take scan_mode and product as arguments and I'm preventing running the checks in those cases. But that seems a solution at the wrong place.
Additional context This would enable gpm-api to be used more versatile by accessing arbitrary data, which still has to conform to the standards.
Hey @kmuehlbauer.
I see what you are looking for, and indeed it is something it has been asked to me also by other people e.g. of the GPM-GV group.
Currently you can open a GPM file (with whatever filename) into a DataTree using open_raw_datatree. However open_raw_datatree just open the data and adds dimension names.
With the following code you can obtain what GPM-API would return using gpm.open_* functions:
from gpm.dataset.datatree import open_raw_datatree
from gpm.dataset.granule import _get_scan_mode_dataset
from gpm.dataset.conventions import finalize_dataset
# Open datatree
dt = open_raw_datatree(filepath, **xarray_kwargs)
# Extract scan-mode dataset (flattening-out subgroups)
scan_mode="FS"
ds= _get_scan_mode_dataset(
dt=dt,
scan_mode=scan_mode,
)
# Finalize dataset
product = "2A-DPR" # could be inferred from file metadata ...
ds = finalize_dataset(
ds=ds,
product= product,
scan_mode=scan_mode,
decode_cf=True,
However I guess you suggesting to create a function à la open_mfdataset, accepting multiple filepaths and returning a dataset/datatree equivalent to what is returned by gpm.open_* functions right?
In that case the various code components are already present. We could just do:
import xarray as xr
from gpm.dataset.dataset import _get_scan_modes_datasets_and_closers
from gpm.dataset.conventions import finalize_dataset
scan_modes = ["FS"] # in future not needed or optional
product = "2A-DPR" # in future not needed
dict_scan_modes, list_dt_closers = _get_scan_modes_datasets_and_closers(
filepaths=filepaths,
parallel=parallel,
scan_modes=scan_modes,
decode_cf=False,
chunks=-1,
# Custom options
# variables=variables,
# groups=groups,
# prefix_group=prefix_group,
**xarray_kwargs,
)
# Finalize datatree
dict_scan_modes = {
scan_mode: finalize_dataset(
ds=ds,
product=product,
scan_mode=scan_mode,
decode_cf=True,
)
for scan_mode, ds in dict_scan_modes.items()
}
# Create datatree
dt = xr.DataTree.from_dict(dict_scan_modes)
# Specify scan modes closers
for scan_mode, ds in dict_scan_modes.items():
dt[scan_mode].set_close(ds._close)
# Specify files closers
dt.set_close(partial(_multi_file_closer, list_dt_closers))
To avoid users to specify scan_modes, here we should just add if scan_modes is None: <code to retrieve datatree scan modes>.
Similarly, product could be inferred from the datatree global attributes, but I guess we need to create a dictionary mapping from GPM product names to GPM-API product names.
Are you willing to draft a PR? I will not have time to work on that in the coming 2-3 weeks for sure unfortunately ...
Thanks @ghiggi, I'll work with your small example for now. Let's see, if I can come up with something useful. But I have similar time constraints.