pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

DOC: How to handle / customize log records emitted by pypdf

Open stefan6419846 opened this issue 11 months ago • 8 comments

Explanation

According to https://github.com/py-pdf/pypdf/blob/11ee6480a3f795d770da89944f32a977e3c110e2/pypdf/_utils.py#L433-L449

logger_warning is advertised to allow overwriting, but doing so proves to be more complicated than expected.

There basically are two reasons for this:

  1. The implementation is located at pypdf._utils.logger_warning, which usually marks an internal implementation and thus I would not assume this to be future-proof.
  2. The method usually is imported directly, as can be seen in https://github.com/py-pdf/pypdf/blob/11ee6480a3f795d770da89944f32a977e3c110e2/pypdf/filters.py#L46 for example. This requires patching each import separately.

stefan6419846 avatar Jul 26 '23 08:07 stefan6419846

The overwriting was meant for testing, so rather for contributors to pypdf than pure users, if I remember correctly.

From a users perspective, this should be good:

import logging
from pypdf import PdfReader


logger = logging.getLogger("pypdf")
logger.setLevel(logging.ERROR)  # <--- if you comment this line out, you will see a warning
    
pdf_path = "sample-files/017-unreadable-meta-data/unreadablemetadata.pdf"
reader = PdfReader(pdf_path)

The warning you will see:

Invalid parent xref., rebuild xref

Does that solve the issue?

MartinThoma avatar Jul 28 '23 20:07 MartinThoma

The overwriting was meant for testing, so rather for contributors to pypdf than pure users, if I remember correctly.

Thanks for the explanations. For me it sounds like this indeed might be a user-callable configuration.

From a users perspective, this should be good:

While this might be a solution, logger_warning has the advantage that I can more easily change the actual log level. In this specific use case, I do not consider the warnings I have seen from PyPDF as actual warnings to be analyzed, so I am currently using logger_warning to reduce the log level to info. With this, I do not skip any debugging information on my side, but reduce the noise in monitoring where everything below the warning level gets ignored.

stefan6419846 avatar Jul 29 '23 05:07 stefan6419846

I don't understand the issue here. What are you doing that the following doesn't do?

import logging
logger = logging.getLogger("pypdf")
logger.setLevel(logging.WARNING)

MartinThoma avatar Jul 29 '23 05:07 MartinThoma

I am doing the following for the different logger_warning occurrences:

import logging
from contextlib import contextmanager

from pypdf import _reader, PdfReader


def custom_logger_warning(msg, src):
    logging.getLogger(src).info(msg)


@contextmanager
def reduce_log_level():
    old_logger_warning = _reader.logger_warning
    _reader.logger_warning = custom_logger_warning
    yield
    _reader.logger_warning = old_logger_warning


with reduce_log_level():
    reader = PdfReader('file.pdf')

This basically allows me to temporarily reduce the log level for the PyPDF warnings. Why? Logging is enabled including the info level, while monitoring filters everything to start at the warning level. Info messages can be analyzed in the original logfiles if required for some reasons, while keeping the focus on the important stuff in monitoring.

For some use cases (background workers which receive lots of broken/strange files) the warning level is too high as the PDF can apparently be processed in some way. Real exceptions are more urgent in these cases, as they might indicate an actual bug.

stefan6419846 avatar Jul 29 '23 17:07 stefan6419846

I'm sorry, I still don't understand. Your code does not reduce the log-level threshold of the system, but reduce the log level being used.

Couldn't you just do the following to reduce the log level threshold?

import logging
from contextlib import contextmanager

from pypdf import PdfReader


def custom_logger_warning(msg, src):
    logging.getLogger(src).info(msg)


@contextmanager
def reduce_log_level():
    logger = logging.getLogger("pypdf")
    old_level = logger.level
    logger.setLevel(logging.INFO)
    yield
    logger.setLevel(old_level)


with reduce_log_level():
    reader = PdfReader('file.pdf')

Logging is enabled including the info level, while monitoring filters everything to start at the warning level. Info messages can be analyzed in the original logfiles if required for some reasons, while keeping the focus on the important stuff in monitoring.

That part makes me think that you want to manipulate the log records themselves, e.g:

import logging
from pypdf import PdfReader

class NewLogger(logging.Logger):
    def makeRecord(self, *args, **kwargs):
        if len(args) >= 2:
            args = list(args)
            args[1] = logging.INFO
        if 'level' in kwargs:
            kwargs['level'] = logging.INFO
        return super(NewLogger, self).makeRecord(*args, **kwargs)

logging.basicConfig()
logging.setLoggerClass(NewLogger)

pdf_logger = logging.getLogger('pypdf')

# Now, all log messages from pypdf will be converted to the "INFO" level.

# Test
pdf_logger.info("This will be captured in the log.")
pdf_logger.warning("This will also be captured in the log.")
pdf_logger.error("This will also be captured in the log.")


# Especially
pdf_path = "sample-files/017-unreadable-meta-data/unreadablemetadata.pdf"
reader = PdfReader(pdf_path)

you can apply those two patterns at the same time, of course, to make a decorator.

MartinThoma avatar Jul 31 '23 17:07 MartinThoma

Your code does not reduce the log-level threshold of the system, but reduce the log level being used.

This is indeed correct and desired. As logger_warning promotes such a customization behavior inside its docstring, I assumed that this would indeed be a supported feature out-of-the-box. If I understand it correctly, the docs should be adjusted instead to make it more clear who is intended to monkey-patch logger_warning.

That part makes me think that you want to manipulate the log records themselves

This is true and the proposed plain logging.Logger solution would be a general-purpose alternative, yes.

stefan6419846 avatar Aug 01 '23 06:08 stefan6419846

If I understand it correctly, the docs should be adjusted instead to make it more clear who is intended to monkey-patch logger_warning

Yes, that would be good :-) And I think we should add more examples (like the ones I gave) to our docs. We have a lot of first-time developers who struggle with this and think it indicates a bug.

MartinThoma avatar Aug 03 '23 06:08 MartinThoma

@2HR3Y It does not really sound like you have read the previous discussions and maybe just threw this into some ML model. (1) does not work as logger_warning is not part of the class, (2) will not work as due to using from pypdf._utils import logger_warning basically requires monkey-patching all final import paths.

The actual underlying issue is that the docs tend to be unclear and should be adjusted to better indicate these things.

stefan6419846 avatar Feb 12 '24 19:02 stefan6419846