netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

Combine multiple logging implementations and make logging available to entire library?

Open edhartnett opened this issue 8 years ago • 26 comments

Summary

There are currently two (about to be three) logging implementations in netCDF, libsrc4, dap, and PIO, each with slightly different features. Yet logging cannot be used in many parts of the library. Shall we combine our logging mechanisms, and make the unified logging available everywhere in the library?

Discussion

NetCDF-4 provides logging. See nc_set_log_level() for some details.

In PIO I also have logging. It's very similar. So it would be nice to combine them.

How about the rest of the library? Any logging? It's quite useful. Users that cannot isolate problems can easily turn on logging, and sometimes that is all you need to solve the problem. Meanwhile, logging is not built by default and so does not cost any performance on production builds.

When I combine libsrc4 and PIO logging, shall I make it available to the rest of the library as well? It could be built into classic instead of netCDF4, and so be more widely available. @WardF and @DennisHeimbigner please let me know your views.

edhartnett avatar Nov 14 '17 01:11 edhartnett

I would like to see this.

gsjaardema avatar Nov 14 '17 01:11 gsjaardema

There is a third system of logging: libdispatch/nclog.c used by the DAP2 and DAP4 code.

DennisHeimbigner avatar Nov 14 '17 02:11 DennisHeimbigner

I think the difference between the dap logging and the libsrc4 logging is the use of unamed levels (0-N) vs loggind kinds: NOTE,WARN,ERR,DEBUG.

DennisHeimbigner avatar Nov 14 '17 02:11 DennisHeimbigner

But the names of levels are just integers ultimately, right? Can you show me an example of using your DAP logging in the code so I can take a look?

The PIO logging has some super useful additional features for HPC:

  • It shows rank of processor at the beginning of log entry.
  • It automatically saves log output to log files, one for each processor.

The files in particular are really useful, because the logging output from even (only) 4 processors running at the same time gets very confusing. With the log files, you run your program, and then see what each processor did in its own file.

Log files would be useful for netCDF-4 logging. Instead of capturing screen output, we just ask user to turn on logging, run their problem code, and send in the log file. One less step for the user.

As I mentioned, I plan on combing the two loggers that I added (netcdf4 and PIO). @DennisHeimbigner, if I can meet all requirements, would you consider switching to use a common logging for the entire library?

This will happen after PIO is released. For the initial release I will just leave the separate loggers in place.

edhartnett avatar Nov 14 '17 14:11 edhartnett

@gsjaardema can you expand on why you would like this feature? What uses to you foresee?

HPC applications are (ironically) often written and debugged without debuggers. (I use gdb, but it's certainly more challenging in multi-processor situations.)

So logging can have a greater than expected usefulness to HPC users.

edhartnett avatar Nov 14 '17 14:11 edhartnett

It would be useful to me primarily for HPC/parallel debugging or tracing. On large numbers of procesosrs it is not always easy to use a debugger. Being able to look at the logging output from multiple processors makes it easier to see that they are all doing what is expected.

Also useful for serial runs just to make sure everything is following the correct path -- netcdf, pnetcdf, netcdf4, ...

gsjaardema avatar Nov 14 '17 17:11 gsjaardema

@gsjaardema can I ask in general how you do I/O now from 100K+ cores?

You use pnetcdf of course, but do you write directly from each core?

Or do you gather your data on a subset of cores are do all disk I/O from there?

edhartnett avatar Nov 14 '17 18:11 edhartnett

  1. Look at libdap2/ncd2dispatch.c for instances of nclog().
  2. But these are only of limited use in general. I* t shows rank of processor at the beginning of log entry.
    • It automatically saves log output to log files, one for each processor.
  3. But the names of levels are just integers ultimately, right? True, but the meanings are entirely different. You log integers are all levels of debug info. There is no equivalent of the others: warn, info, etc.

DennisHeimbigner avatar Nov 14 '17 19:11 DennisHeimbigner

Extra HPC featues would only be on for parallel processing systems.

I'm confident I can map the other loggers to your system of warn, info, etc.

When the time comes I will see if I can come up with a good combined logger and you can see what you think. The goal would be to keep compatibility with all existing logging calls, so I don't have to go through and change them everywhere.

Anyway, let's shelve this for a while, until after PIO is released. ;-)

edhartnett avatar Nov 14 '17 19:11 edhartnett

[moved here from issue https://github.com/Unidata/netcdf-c/issues/1040] One thing missing from the netcdf-c library is the general ability to report more detailed execution information. For example, if a routine returns the NC_EPERM error code (access error), there is currently no general way to tell the user what the name of the file for which access failed. A library-wide logging system would help mitigate this deficiency. By enabling logging, a user could get more information about, for example, why a netcdf-c API call failed (and maybe even where).

This issue is intended to come to some agreement about how to provide a unified logging system for the netcdf-c library. The proposal here is basically to provide two systems: a trace system and a log system. The trace system would basically be the existing log system in libsrc4. The log system would be the one used in libdap2+libdap4.

Currently, two logging facilities exist in the netcdfd-c library. One is in the libsrc4 directory and other is in the libdap2 and libdap4 directories.

The two logging systems are different in terms of how they are controlled. In libsrc4, a log request is associated with a "level" number: -1 to 5, where -1 means no logging. If the global level is less than or equal to the level number associated with a given log statement, then that log statement is executed and a log message is output.

In libdap2 and libdap4, the log statement has an associated ordered "class" of message: e.g. (in order) ERROR, WARN, INFO, DEBUG. The classes of messages to be output is determined by a global log level that is one of the classes. If the level is set to a given class, then all log messages at that level or lower will be output, otherwise suppressed. For example, if the log level is INFO, then all log statements tagged ERROR, WARN, or INFO will be output and log statements tagged DEBUG will be suppressed. This logging system is closely patterned on the Java logging system.

In practice, the libsrc4 logging system is used almost exclusively as a trace system rather than logging per-se. Thus, as each procedure is entered, a log/trace statement is executed to report on the procedure and its arguments. Occasionally, log statements are used to report error or warning information, but it is not the norm.

An additional feature for a logging system is to control where the log output is to be sent. For both of the netcdf-c logging systems, stderr is the default. But both can be directed elsewhere (typically to some file).

As an aside, it would be worth noting HDF5 error reporting system and (separate) trace system.

DennisHeimbigner avatar Jul 05 '18 16:07 DennisHeimbigner

Maybe the trace system (nc_set_log_level() and the LOG(()) macro) should remain separate from the error logging mechanism?

Or do we regard all trace statements as different levels of INFO log statements?

Seems like with error logging, what we are talking about is putting in a string for each error, instead of what we have now.

So where we have:

if ((retval = soime_func()))
   BAIL(NC_EINVAL);

We would instead have something like:

if ((retval = soime_func()))
   BAIL(NC_EINVAL, "failed trying to call some func");

And the error string would end up in the log, and (depending on some setting) also printed to stderr. This is how HDF5 handles it.

So when there is an error, you get the nested set of error strings to help figure out the problem.

Dennis is this what you have in mind?

Here's a more concrete example, from nc_def_var()

   /* Check and normalize the name. */
   if ((retval = nc4_check_name(name, norm_name)))
      BAIL(retval, "name failed normalization or checks");

   /* Not a Type is, well, not a type.*/
   if (xtype == NC_NAT)
      BAIL(NC_EBADTYPE, "NAT is not a type");

   /* For classic files, only classic types are allowed. */
   if (h5->cmode & NC_CLASSIC_MODEL && xtype > NC_DOUBLE)
      BAIL(NC_ESTRICTNC3, "For classic files, only classic types are allowed");

   /* For classic files */
   if (h5->cmode & NC_CLASSIC_MODEL && ndims > NC_MAX_VAR_DIMS)
      BAIL(NC_EMAXDIMS, "Too many dims for a classic model file");

edhartnett avatar Jul 05 '18 22:07 edhartnett

We have some options for this. The simplest is, I think, this proposal.

  1. We re-define the log levels to be, in effect, an enumeration with values ERROR, WARN, INFO, DEBUG, DEBUG1, DEBUG2, DEBUG3, DEBUG4, DEBUG5
  2. The existing nc_set_log_level function is deprecated, but is define so that nc_set_log_level(n) => DEBUGn and where 0 => use level DEBUG.
  3. We define new log level get/set function names that utilize the enumeration as argument.
  4. We modify the libsrc4 code LOG calls that are actually logging as opposed to tracing. My guess is that any LOG call with a level of zero is actually a true log operations. My hypothesis is that the number of true log calls is very small compared to the number of trace calls. This means that the amount of code modifications in libsrc4 would be correspondingly small.

If we want to later consider providing separate trace and log functions, then we could define a new trace level set/get that did what the current nc_set_log_level function did.

DennisHeimbigner avatar Jul 06 '18 19:07 DennisHeimbigner

Here are all the LOG 0 statements in the code. Not many at all, compared to the total number of times we return an error code:

ed@mikado:~/tmp/netcdf-c$ find . -name '*.c'|xargs grep 'LOG((0'
./libhdf5/nc4hdf.c:      LOG((0, "Unknown class: %d", type->nc_type_class));
./libhdf5/nc4hdf.c:                  LOG((0, "no dimscale found!"));
./libhdf5/nc4hdf.c:         LOG((0, "%s: Unknown destination type.", __func__));
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:         LOG((0, "%s: unexpected dest type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:      LOG((0, "%s: unexpected src type. src_type %d, dest_type %d",
./libhdf5/nc4hdf.c:      LOG((0,"Type = %s(%lld) name='%s'",typename,printid,name));
./libhdf5/nc4hdf.c:      LOG((0,"\nReport: open objects on %lld",(long long)fid));
./libhdf5/hdf5file.c:         LOG((0, msg));
./libhdf5/hdf5file.c:      LOG((0, "unknown class"));
./libhdf5/hdf5file.c:      LOG((0, "Unknown object class %d in %s!", oinfo.statbuf.type, __func__));
./libhdf5/hdf5internal.c:      LOG((0, "Couldn't turn off HDF5 error messages!"));
./libhdf5/hdf5internal.c:         LOG((0, "H5Eset_auto failed!"));
./libsrc4/nc4internal.c:   /*       LOG((0, "H5Eset_auto failed!")); */
./libsrc4/nc4internal.c:         LOG((0, "Unknown class: %d", type->nc_type_class));
./nc_test4/tst_h_vl2.c:		     LOG((0, "unknown datatype class"));
./nc_test4/tst_h_vl2.c:	       LOG((0, "Unknown object class %d in nc4_rec_read_metadata!",

Some important differences between error logging and tracing:

  • users almost always want errors but almost never want tracing.
  • trace output is very, very large for most real-world netcdf applications, error logs would be very small.
  • trace statements incur a performance penalty when used, error log statements do not, because they are only called in error conditions, not in normal operation.

So whatever you are proposing should accommodate those differences.

We should not just switch everything to the same function, and always call it, and sometimes not do any output. This is too much of a performance hit for a trace. We really want the trace statements to completely disappear in normal builds - we don't want to force a context switch, and then do an evaluation, and then another context switch. The current LOG() macro accomplishes this, and this capability should be retained.

Are you proposing that we make the LOG 0 calls into error log calls, but leave all the LOG 1-5 as they are, so that they will disappear from the code if --enable-logging is not used? I would be in favor of that.

I would also be in favor of a modified BAIL macro which added a string to the error log; we could then start converting to the new BAIL macro and that would provide the user with more information in error conditions.

And there would also need to be a function (like H5set_auto()) which allows the user to turn dumps of the error log on and off. Possibly also some function to read messages from the error log. Also a maximum message size, probably NC_MAX_NAME.

edhartnett avatar Jul 06 '18 19:07 edhartnett

The more I consider it, the more I think that tracing and error logging are two different things which should not be combined.

However, each can be improved: the trace statements by allowing per-task file output, and the error logging by allowing error messages to be pushed on the error stack each time the code returns an error code.

edhartnett avatar Jul 06 '18 20:07 edhartnett

Good point about cost of tracing. That means we need two systems: one for logging and one for tracing. That means also that to avoid major code upheaval in libsrc4, we are temporarily going to have to:

  1. make LOG() call the trace function
  2. for those LOG() statements that are not doing tracing, we will need to modify them by hand to directly call the (new) log function.
  3. As you note, the new trace mechanism needs to have its set of TRACE() macros than can be conditionally included in the code. So we will need to pieces of global state: log level and a trace level plus corresponding set/get function. But one problem. The log level INFO might take up significant compute time so we need to figure out a way to deal with that. One way is to ensure that the arguments to the log call do not require significant computation (e.g. only pass variables and constants).

DennisHeimbigner avatar Jul 06 '18 20:07 DennisHeimbigner

Why not continue to use the current LOG() mechanism for tracing. (We can change it to TRACE() to make it more clear.) It does not need to call the error logging function.

Then modify the LOG 0 statements, by hand, to call the error logging function instead of LOG().

Then provide a new version of BAIL which accepts a string, and adds it to the error log when bailing out of a function.

edhartnett avatar Jul 06 '18 21:07 edhartnett

BTW I have a massive refactor of libsrc4/libhdf5 under way to do lazy read of vars. So I would like to apply these changes myself, after the lazy read work is complete and merged, if that's OK.

edhartnett avatar Jul 06 '18 21:07 edhartnett

The reason to temporarily hijack LOG() macro in libsrc4 (and libhdf5?) was to avoid having to do a mass edit of most of the files in libsrc4+libhdf5. But I suppose if we do it as a single PR, it would be ok.

DennisHeimbigner avatar Jul 06 '18 21:07 DennisHeimbigner

As I say, as long as its an isolated single PR.

DennisHeimbigner avatar Jul 06 '18 21:07 DennisHeimbigner

What is to be the task in the large edit?

1 - To change LOG 0 to error log calls. 2 - To change LOG to TRACE but otherwise leave it alone. 3 - To add a BAIL macro that takes an error message, wherever we currently just return an error code.

edhartnett avatar Jul 06 '18 21:07 edhartnett

That sounds right. It is possible that there are LOG calls with level > 0 that are actually error reports, but we can deal with those later.

DennisHeimbigner avatar Jul 06 '18 21:07 DennisHeimbigner

OK,also we can add that we want the trace to output to a file, or multiple files for multi-processor situations.

I will make these changes after I have submitted the lazy vars change for review, and they are merged.

edhartnett avatar Jul 06 '18 21:07 edhartnett

The logging code in libdispatch (used in libdap2 and libdap4) supports both logging as we are now defining it and also supports control for directing the output to e.g. a file or stderr. You might want to start with that.

DennisHeimbigner avatar Jul 06 '18 21:07 DennisHeimbigner

OK, will do!

edhartnett avatar Jul 06 '18 23:07 edhartnett

Another thing. We also need to decide on the ./configure flag(s) for this. A more general issue is back compatibility. My guess is that very few users run with logging enabled, but we need to make a reasonable accomodation for that case.

DennisHeimbigner avatar Jul 07 '18 18:07 DennisHeimbigner

OK, I'm returning to this issue.

Since this was started, I've changed the logging (that is, the tracing) of netCDF-4 to better handle MPI. Now each processor will produce it's own log file.

I've also added documentation for logging.

What remains to do is to make the improvements to error handling mentioned in this issue. I will ponder those...

edhartnett avatar Aug 14 '24 14:08 edhartnett

I think this is unlikely to change in the near future. I will close this issue.

edhartnett avatar Sep 10 '25 12:09 edhartnett