pytesmo icon indicating copy to clipboard operation
pytesmo copied to clipboard

Warnings about bad input data

Open awst-baum opened this issue 6 years ago • 5 comments

As far as I can see: if there's "bad" input data, pytesmo usually either drops it or issues a (sometimes quite generic) warning. Examples are:

  • pytesmo.validation_framework.data_manager.DataManager.read_ds: warnings are given but exception and sometimes dataset name and arguments information are omitted.
  • pytesmo.temporal_matching.df_match, lines 90-117: If there are no matches between data and reference, no warning is given and an empty (or filled with NaN) DataFrame is returned.

Is there generic philosophy behind this like "don't bother the user at all, just give them the results we can produce and let them look into missing or faulty data themselves"?

Since we're currently trying to build a user-friendly webservice that uses pytesmo for validations, we'd like to tell the user not only "x% of your input data didn't yield results" but also ideally why that was the case. But that may clash with the more Python-developer-oriented approach pytesmo has? Would you be open to us adding more warnings? How much would be too much?

awst-baum avatar Sep 17 '18 14:09 awst-baum

If the dataset reading fails then only the reader class can issue a specific warning since pytesmo can not know why the reading failed. We can of course add the requested gpi or lon, lat and the data source name to the pytesmo level warning.

For the temporal matching we can add a warning if no matches are found. Probably in the validation framework since the temporal matcher does not have all info to issue a good warning.

I could also imagine a strict mode or something like that which raises an exception for these failures.

A more general question is if a warning is enough for your purposes? Would you not prefer a results object with more detailed information at which step a validation failed?

On Mon, Sep 17, 2018, 16:34 D. Baum [email protected] wrote:

As far as I can see: if there's "bad" input data, pytesmo usually either drops it or issues a (sometimes quite generic) warning. Examples are:

  • pytesmo.validation_framework.data_manager.DataManager.read_ds: warnings are given but exception and sometimes dataset name and arguments information are omitted.
  • pytesmo.temporal_matching.df_match, lines 90-117: If there are no matches between data and reference, no warning is given and an empty (or filled with NaN) DataFrame is returned.

Is there generic philosophy behind this like "don't bother the user at all, just give them the results we can produce and let them look into missing or faulty data themselves"?

Since we're currently trying to build a user-friendly webservice that uses pytesmo for validations, we'd like to tell the user not only "x% of your input data didn't yield results" but also ideally why that was the case. But that may clash with the more Python-developer-oriented approach pytesmo has? Would you be open to us adding more warnings? How much would be too much?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TUW-GEO/pytesmo/issues/152, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXP_4-a_Xqb5T858qU-kfvsD0PtGLtiks5ub7LrgaJpZM4WsDGt .

cpaulik avatar Sep 17 '18 15:09 cpaulik

I could also imagine a strict mode or something like that which raises an exception for these failures.

Might be done with https://docs.python.org/3/library/warnings.html#the-warnings-filter ?

Re results object: I hadn't thought that far. It sounds promising/interesting but may be a major change, right? A tricky part may be storing the results into a netcdf file when they contain error reports as well as results arrays. For the webservice, we're looking at both short-term and long-term solutions.

PS: I'm currently playing around in a branch here but haven't done too much yet: https://github.com/awst-austria/pytesmo/tree/verbose_warnings I need to define some unit tests...

awst-baum avatar Sep 17 '18 16:09 awst-baum

Might be done with https://docs.python.org/3/library/warnings.html#the-warnings-filter ?

Yes that should work fine.

Re results object: I hadn't thought that far. It sounds promising/interesting but may be a major change, right?

Using a results object instead of the dictionary we currently should not be too big of a change. But I could be wrong.

A tricky part may be storing the results into a netcdf file when they contain error reports as well as results arrays.

We would have to come up with a flagging system where each error has a value. This should then be fairly easy to store according to CF conventions. See http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#flags

cpaulik avatar Sep 18 '18 15:09 cpaulik

And the results object would be put together in pytesmo.validation_framework.validation.Validation.perform_validation?

Of course the trick for creating a netcdf output format would be to foresee the problems that occur and categorise them in a useful fashion (NOT so that all practically occurring issues ends up in "other errors"). And then to write a reader/writer for it, I guess?

awst-baum avatar Sep 20 '18 08:09 awst-baum

And the results object would be put together in pytesmo.validation_framework.validation.Validation.perform_validation?

Yes.

Of course the trick for creating a netcdf output format would be to foresee the problems that occur and categorise them in a useful fashion (NOT so that all practically occurring issues ends up in "other errors"). And then to write a reader/writer for it, I guess?

For every exception that we have we can add an error code/value/bit that we then set in the result. The ResultsManager will have to be updated.

cpaulik avatar Sep 21 '18 13:09 cpaulik