pooch icon indicating copy to clipboard operation
pooch copied to clipboard

Change how we handle verbosity and logging

Open leouieda opened this issue 2 years ago • 6 comments

The current implementation uses the logging module to print messages about downloads and processing operations. This has a few drawbacks that we didn't expect when it was first implemented:

  1. We have verbosity on by default by setting a handler for the logger, which is against recommendations for libraries
  2. Pooch is a library meant to be used in other libraries (pooch.Pooch) but also in end-user code (pooch.retrieve) in notebooks/scripts. For libraries, silent Pooch by default makes more sense but when using a notebook default verbosity is probably better since we let users know about downloads, processing, and hashes.
  3. If a library changes the Pooch logging handler (to silence or explicitly enable it), all of their dependencies will have the default Pooch behaviour changed without warning.
  4. Disabling verbosity requires knowing about the logging module.

A better implementation would be to have a verbose keyword argument in Pooch and retrieve that can be used to generate a printer object (like the rich.console.Console) object that is passed on to the downloaders and processors. Or even a global pooch printer that can be configured by a function (I'd rather avoid having mutable global state, though, since it's a nightmare to test). The existing functionality can be maintained having pooch.get_logger return a mock of the Logger object that modifies our printer instead.

This way:

  • Libraries will be properly isolated from each other
  • Setting verbosity is not global can requires only a single keyword argument change
  • Backwards compatibility is maintained by setting verbose=True by default and making get_logger return something that can be used as a global verbosity setting

Or we just give up backward compatibility and release Pooch 2.0.0 (with a deprecation warning in a previous release).

Feedback from users on this would be much appreciated, particularly if Pooch is one of your dependencies.

See #301 #232

leouieda avatar Feb 09 '22 11:02 leouieda

xarray (I think) and regionmask would likely welcome this change.

A proper deprecation could be a bit of a pain. You could check if get_logger().level != default and issue a warning if not? (And supress output if get_logger().level > 20?)

mathause avatar Feb 09 '22 12:02 mathause

I'd also welcome this change in icepack.

danshapero avatar Feb 09 '22 17:02 danshapero

IMO, a much simpler start would be to remove this line:

https://github.com/fatiando/pooch/blob/07336ef0f5e6d774c0b1cd33ef1c49e1f924a20d/pooch/utils.py#L24

By default, the logging module is setup with a lastResort handler that, in the absence of any other handler, sends all WARNING or higher messages to stderr.

I don't have skin in this game, I'm content to roll with whatever Pooch does by default. I'll just say that logging frameworks exist so that all the API calls in a system don't need to grow arguments to control output. I'm also not personally convinced that logging is somehow inadequate here.

dopplershift avatar Feb 10 '22 19:02 dopplershift

Thanks for the feedback all.

I'll just say that logging frameworks exist so that all the API calls in a system don't need to grow arguments to control output. I'm also not personally convinced that logging is somehow inadequate here.

I agree that logging is great and it makes implementing this sort of thing much easier. But I think the use case that logging is designed for is not how we want to use it. From my understanding, logging is meant to be used:

  • In a library, call logging setting the null handler (i.e., silent)
  • Application developers set the handlers they want for the libraries they want
  • End-users don't write code and see the logging messages that the application developers want them to see
  • The application code is not used to build further applications downstream

For this, the recommend of not setting a handler in library code makes perfect sense. And by having singleton loggers, application developers can set handlers for the entire stack.

But Pooch is used in mainly 2 ways:

  1. Library developers use Pooch. Their users are mostly 1) other library developers downstream (like xarray -> metpy) or 2) people writing scripts/notebooks using the sample data (usually tutorials or docs).
  2. Directly by people writing scripts and notebooks (through pooch.retrieve). This is probably where logging messages are most useful because these people know they are using Pooch.

For 1. if any of the libraries set a handler, then they influence everyone else in the stack in strange ways. For example, if scikit-image adds a handler and metpy doesn't, an end user will see logging messages for scikit-image and metpy in their notebook if they import both but not if only importing metpy. So the only ones who should be setting handlers are the end-users writing the scripts/notebooks during a tutorial or following the docs. They probably won't since the whole idea is that they don't even need to know Pooch exists.

For 2. it would be a bad experience to have to import the logging module and add a handler just to get a message printed when calling retrieve. So something like retrieve(url, hash, verbose=True) makes sense here.

Which then brings us to:

By default, the logging module is setup with a lastResort handler that, in the absence of any other handler, sends all WARNING or higher messages to stderr.

Removing that line would mean Pooch is silent by default (we mostly issue INFO level messages). So we might as well set a NullHandler as the logging docs suggest. If any end-user really cares, they can set the handler (most won't). We would need to warn library devs to not set a handler so they don't influence other libraries. But this would mean that there is no way for libraries to control the verbosity of Pooch and most of the messages will likely never be seen.

I'm OK with that but I'm hesitant to make the change and piss people off who want a way to enable/disable this. I don't see a way of delivering that other than the implementation I outlined above. I know this is probably not a big deal and most people won't care either way.

leouieda avatar Feb 11 '22 10:02 leouieda

Just to support the point - xarray sets the pooch log-level when tutorial data is downloaded:

    logger = pooch.get_logger()
    logger.setLevel("WARNING")

Setting the log-level for other users was certainly unintentional, could be fixed easily, and does not have huge consequences, but shows that it's easy to get wrong.

mathause avatar Feb 11 '22 10:02 mathause

I came across this in the following use case:

  1. Our application code calls geodatasets which calls pooch
  2. pooch issues the "Downloading" message to the console
  3. Our code runs on Google Cloud in such a way that all console messages are captured and displaying in a GCP logging site
  4. For the logs to work right on that GCP site, they need to be formatted as JSON, otherwise they get tagged as an "error"

In our application code, we handle this by setting the root logger to have a console handler that formats the message as JSON.

But due to the line suggested to be removed in this comment: https://github.com/fatiando/pooch/issues/302#issuecomment-1035432931 the message doesn't get formatted.

My suggested solution is as follows:

Replace LOGGER.addHandler(logging.StreamHandler()) in pooch with:

if len(logging.getLogger().handlers) == 0:
    LOGGER.addHandler(logging.StreamHandler()) 

Then if the user hasn't created any handlers for the root logger, pooch will add its handler, which I think covers the use cases for people using notebooks.

But if a user has added handlers, or a library has added handlers to the root logger, then pooch output will just use those handlers.

I think this addresses your use cases, but not 100% sure.

Dr-Irv avatar Nov 06 '23 21:11 Dr-Irv