aospy
aospy copied to clipboard
Warn or raise when needed DataLoader attributes aren't included
It is possible to instantiate a GFDLDataLoader that doesn't include all of the attributes (e.g. data_dur) that are ultimately required to find the file. Currently, we don't warn or raise when this happens.
This leads to non-intuitive crashes. E.g. when I try a calculation using a Run whose data_loader mistakenly didn't have a data_dur attribute, I get this traceback, which is coming from the fact that data_dur
is None:
INFO:root:Getting input data: Var instance "prec_ls" (Thu Feb 2 00:42:23 2017)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/Spencer.Hill/py/scripts/main.py in <module>()
127 if __name__ == '__main__':
128 calcs = main(mp, print_table=mp.print_table, prompt_verify=True,
--> 129 exec_calcs=mp.compute, parallelize=mp.parallelize)
130
131
/home/Spencer.Hill/py/aospy_user/aospy_user/main.pyc in main(main_params, exec_calcs, print_table, prompt_verify, parallelize)
214 else:
215 calcs = cs.create_calcs(param_combos, exec_calcs=exec_calcs,
--> 216 print_table=print_table)
217 return calcs
/home/Spencer.Hill/py/aospy_user/aospy_user/main.pyc in create_calcs(self, param_combos, exec_calcs, print_table)
160 if exec_calcs:
161 try:
--> 162 calc.compute()
163 except RuntimeError as e:
164 logging.warn(repr(e))
/home/Spencer.Hill/py/aospy/aospy/calc.pyc in compute(self, save_files, save_tar_files)
641 """Perform all desired calculations on the data and save externally."""
642 data = self._prep_data(self._get_all_data(self.start_date,
--> 643 self.end_date),
644 self.var.func_input_dtype)
645 logging.info('Computing timeseries for {0} -- '
/home/Spencer.Hill/py/aospy/aospy/calc.pyc in _get_all_data(self, start_date, end_date)
475 end_date, n),
476 self.var.func_input_dtype)
--> 477 for n, var in enumerate(self.variables)]
478
479 def _local_ts(self, *data):
/home/Spencer.Hill/py/aospy/aospy/calc.pyc in _get_input_data(self, var, start_date, end_date, n)
427 data = self.data_loader.load_variable(var, start_date, end_date,
428 self.time_offset,
--> 429 **self.data_loader_attrs)
430 name = data.name
431 data = self._add_grid_attributes(
/home/Spencer.Hill/py/aospy/aospy/data_loader.pyc in load_variable(self, var, start_date, end_date, time_offset, **DataAttrs)
187 """
188 file_set = self._generate_file_set(var=var, start_date=start_date,
--> 189 end_date=end_date, **DataAttrs)
190 ds = _load_data_from_disk(file_set)
191 ds = _prep_time_data(ds)
/home/Spencer.Hill/py/aospy/aospy/data_loader.pyc in _generate_file_set(self, var, start_date, end_date, domain, intvl_in, dtype_in_vert, dtype_in_time, intvl_out)
391 file_set = self._input_data_paths_gfdl(
392 name, start_date, end_date, domain, intvl_in, dtype_in_vert,
--> 393 dtype_in_time, intvl_out)
394 if all([os.path.isfile(filename) for filename in file_set]):
395 return file_set
/home/Spencer.Hill/py/aospy/aospy/data_loader.pyc in _input_data_paths_gfdl(self, name, start_date, end_date, domain, intvl_in, dtype_in_vert, dtype_in_time, intvl_out)
422 name, domain, dtype, intvl_in, year, intvl_out,
423 self.data_start_date.year, self.data_dur))
--> 424 for year in range(start_date.year, end_date.year + 1)]
425 files = list(set(files))
426 files.sort()
/home/Spencer.Hill/py/aospy/aospy/utils/io.pyc in data_name_gfdl(name, domain, data_type, intvl_type, data_yr, intvl, data_in_start_yr, data_in_dur)
153 """Determine the filename of GFDL model data output."""
154 # Determine starting year of netCDF file to be accessed.
--> 155 extra_yrs = (data_yr - data_in_start_yr) % data_in_dur
156 data_in_yr = data_yr - extra_yrs
157 # Determine file name. Two cases: time series (ts) or time-averaged (av).
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'
I don't want to switch to positional arguments, but I think we should at the very least warn when the needed attributes are missing. Maybe raising is too much, since then a user won't even be able to import their object library -- it might be more user-friendly to warn, so that they can use other objects but also know that this particular object will fail if they try to use it.
Agreed, that is a bad error message
Hey @spencerahill I've been looking at this one today. I have a few prefatory questions.
TLDR - I couldn't get the loader to work with GDFL data from ftp://nomads.gfdl.noaa.gov/gfdl_am2_1/AM2.1_1979-2000-AllForc_h1/
I am not sure if I am getting out of scope data (obscure data format the loader is not designed to work with), or doing something wrong in AOSPY. Do you have a link to some post-processed GDFL data that is in-scope for the GDFL loader?
Details: I downloaded GDFL data from an experimental run of AM 2.1, specifically from ftp://nomads.gfdl.noaa.gov/gfdl_am2_1/AM2.1_1979-2000-AllForc_h1/, and tried to go through the process of loading it.
I tried loading in files related to the variable hur
, relative_humidity
via https://pcmdi.llnl.gov/ipcc/standard_output14.html#Table_A1a.
Here are the paths that that the loader generated, and threw an error with:
[['.../pp/atmos/ts/monthly/5yr/atmos.198001-198412.hur.nc',
'.../pp/atmos/ts/monthly/5yr/atmos.198501-198912.hur.nc',
'..../pp/atmos/ts/monthly/5yr/atmos.199001-199412.hur.nc',
'.../pp/atmos/ts/monthly/5yr/atmos.199501-199912.hur.nc']]
This is pretty close to format of this GDFL post-processed data, but not quite.
A representative file from this directory was:
.../pp/atmos/ts/monthly/hur_A1.198001-198412.nc
e.g.:
the loader expected
[rootdir]/[domain]/[dtype_in_time]/[intvl_in]/[data_dur]yr/[domain].[date range].[variable name].nc
and in this dataset, the format seemed to be:
[rootdir]/[domain]/[dtype_in_time]/[intvl_in]/[variable_name][a shorthand version of the IPCC Table identifier].[date range].nc
Am I doing something wrong? (That could totally be what's going on! :) Here is what I tried: https://github.com/haydenbetts/aospy-run-test) Is this post-processed data from AM 2.1 out of scope for the GFDLDataLoader? If so, where can I find in-scope data?
You're not doing anything wrong, as indeed the filenames for the data you found doesn't match the pattern we have built the GFDLDataLoader around. Here is another example.
Presumably both of these are for model data that was ultimately intended for the CMIP archives, based on the fact that they use the CMIP standard variable names rather than GFDL's in-house standard names, i.e. hur
instead of rh
(or 'tas' instead of 't_surf'). Ultimately, this isn't the use-case we're interested in; the directory structure and filenaming patterns that are written into the GFDLDataLoader are the ones that are used by modern GFDL's in-house models and are thus what we care about.
I'm sure there is some publicly available GFDL data in the proper format, but I can't find any right now; in fact many of the links from the GFDL data portal page seem broken. @spencerkclark, do you have any on hand?
Also, @spencerkclark wrote the unit tests e.g. here that cover the GFDLDataLoader, but AFAICT those tests generate the needed objects and data for the tests as they run. So another option is to try to do something similar. In fact, since this (like all new code) will require unit tests of its own, this ultimately might be the best way to proceed...i.e. you'll have to do it sooner or later.
In other words, maybe don't worry about finding test data in the wild matching the pattern: just construct it like those existing unit tests have.
Also, thanks for describing the problem very clearly, which is really helpful.
@haydenbetts thanks for your interest; sorry for being silent for a bit.
I agree with @spencerahill regarding thinking about this problem in a more abstract sense, i.e. without worrying about having actual example files, as that will be useful for writing tests. In this case I think you might not need to worry about filenames. For testing you might be able to create GFDLDataLoaders with and without the required input arguments and make sure an error is raised under the appropriate circumstances.
Nevertheless, for your reference, I put up a small set of files in the form of a tar archive on Google Drive that fit this directory/naming structure in case you'd like to try things out in a practical setting.