pypdf
pypdf copied to clipboard
DOC: How to handle / customize log records emitted by pypdf
Explanation
According to https://github.com/py-pdf/pypdf/blob/11ee6480a3f795d770da89944f32a977e3c110e2/pypdf/_utils.py#L433-L449
logger_warning
is advertised to allow overwriting, but doing so proves to be more complicated than expected.
There basically are two reasons for this:
- The implementation is located at
pypdf._utils.logger_warning
, which usually marks an internal implementation and thus I would not assume this to be future-proof. - The method usually is imported directly, as can be seen in https://github.com/py-pdf/pypdf/blob/11ee6480a3f795d770da89944f32a977e3c110e2/pypdf/filters.py#L46 for example. This requires patching each import separately.
The overwriting was meant for testing, so rather for contributors to pypdf than pure users, if I remember correctly.
From a users perspective, this should be good:
import logging
from pypdf import PdfReader
logger = logging.getLogger("pypdf")
logger.setLevel(logging.ERROR) # <--- if you comment this line out, you will see a warning
pdf_path = "sample-files/017-unreadable-meta-data/unreadablemetadata.pdf"
reader = PdfReader(pdf_path)
The warning you will see:
Invalid parent xref., rebuild xref
Does that solve the issue?
The overwriting was meant for testing, so rather for contributors to pypdf than pure users, if I remember correctly.
Thanks for the explanations. For me it sounds like this indeed might be a user-callable configuration.
From a users perspective, this should be good:
While this might be a solution, logger_warning
has the advantage that I can more easily change the actual log level. In this specific use case, I do not consider the warnings I have seen from PyPDF as actual warnings to be analyzed, so I am currently using logger_warning
to reduce the log level to info
. With this, I do not skip any debugging information on my side, but reduce the noise in monitoring where everything below the warning
level gets ignored.
I don't understand the issue here. What are you doing that the following doesn't do?
import logging
logger = logging.getLogger("pypdf")
logger.setLevel(logging.WARNING)
I am doing the following for the different logger_warning
occurrences:
import logging
from contextlib import contextmanager
from pypdf import _reader, PdfReader
def custom_logger_warning(msg, src):
logging.getLogger(src).info(msg)
@contextmanager
def reduce_log_level():
old_logger_warning = _reader.logger_warning
_reader.logger_warning = custom_logger_warning
yield
_reader.logger_warning = old_logger_warning
with reduce_log_level():
reader = PdfReader('file.pdf')
This basically allows me to temporarily reduce the log level for the PyPDF warnings. Why? Logging is enabled including the info level, while monitoring filters everything to start at the warning level. Info messages can be analyzed in the original logfiles if required for some reasons, while keeping the focus on the important stuff in monitoring.
For some use cases (background workers which receive lots of broken/strange files) the warning level is too high as the PDF can apparently be processed in some way. Real exceptions are more urgent in these cases, as they might indicate an actual bug.
I'm sorry, I still don't understand. Your code does not reduce the log-level threshold of the system, but reduce the log level being used.
Couldn't you just do the following to reduce the log level threshold?
import logging
from contextlib import contextmanager
from pypdf import PdfReader
def custom_logger_warning(msg, src):
logging.getLogger(src).info(msg)
@contextmanager
def reduce_log_level():
logger = logging.getLogger("pypdf")
old_level = logger.level
logger.setLevel(logging.INFO)
yield
logger.setLevel(old_level)
with reduce_log_level():
reader = PdfReader('file.pdf')
Logging is enabled including the info level, while monitoring filters everything to start at the warning level. Info messages can be analyzed in the original logfiles if required for some reasons, while keeping the focus on the important stuff in monitoring.
That part makes me think that you want to manipulate the log records themselves, e.g:
import logging
from pypdf import PdfReader
class NewLogger(logging.Logger):
def makeRecord(self, *args, **kwargs):
if len(args) >= 2:
args = list(args)
args[1] = logging.INFO
if 'level' in kwargs:
kwargs['level'] = logging.INFO
return super(NewLogger, self).makeRecord(*args, **kwargs)
logging.basicConfig()
logging.setLoggerClass(NewLogger)
pdf_logger = logging.getLogger('pypdf')
# Now, all log messages from pypdf will be converted to the "INFO" level.
# Test
pdf_logger.info("This will be captured in the log.")
pdf_logger.warning("This will also be captured in the log.")
pdf_logger.error("This will also be captured in the log.")
# Especially
pdf_path = "sample-files/017-unreadable-meta-data/unreadablemetadata.pdf"
reader = PdfReader(pdf_path)
you can apply those two patterns at the same time, of course, to make a decorator.
Your code does not reduce the log-level threshold of the system, but reduce the log level being used.
This is indeed correct and desired. As logger_warning
promotes such a customization behavior inside its docstring, I assumed that this would indeed be a supported feature out-of-the-box. If I understand it correctly, the docs should be adjusted instead to make it more clear who is intended to monkey-patch logger_warning
.
That part makes me think that you want to manipulate the log records themselves
This is true and the proposed plain logging.Logger
solution would be a general-purpose alternative, yes.
If I understand it correctly, the docs should be adjusted instead to make it more clear who is intended to monkey-patch logger_warning
Yes, that would be good :-) And I think we should add more examples (like the ones I gave) to our docs. We have a lot of first-time developers who struggle with this and think it indicates a bug.
@2HR3Y It does not really sound like you have read the previous discussions and maybe just threw this into some ML model. (1) does not work as logger_warning
is not part of the class, (2) will not work as due to using from pypdf._utils import logger_warning
basically requires monkey-patching all final import paths.
The actual underlying issue is that the docs tend to be unclear and should be adjusted to better indicate these things.