pdfplumber Feature: Warn on unicode decoding errors in PDF annotations

In certain scenarios, annotations may include invalid or extraneous data that can obstruct the annotation processing workflow. To mitigate this, the warn_unicode_error parameter in the PDF initializer and the .open() method provides a configurable option to bypass these errors and generate warnings instead, ensuring smoother handling of such anomalies.

Example warning, if activated:

UserWarning: Could not decode contents for annotation. Annotation contents will be missing.

Aug 30 '24 08:08 stolarczyk

Hi @stolarczyk, and thank you for this suggestion. It makes sense to provide such warnings, although I'd lean toward a more generalizable approach rather than specifying parameters for each type of warning. To that end, I'm more inclined to use Python's built-in warning filtering. I'm open to other opinions, though. What do you think?

Sep 09 '24 20:09 jsvine

thanks for looking into this, @jsvine!

I’m not entirely clear on how your idea regarding built-in warning filtering would address the issue I'm focusing on. The proposed change turns an exception (UnicodeDecodeError) into a warning, which prevents the PDF processing from crashing entirely. So the warning is the result, not the issue at hand.

Sep 10 '24 09:09 stolarczyk

My apologies for the misunderstanding! I think the name of the proposed parameter threw me off, but I also should have looked more closely. I think I understand it now. This proposal makes sense, though what about tweaking the name?: raise_unicode_errors=True/False?

Oct 03 '24 02:10 jsvine

Thanks for the suggestion. Just renamed it.

Nov 13 '24 17:11 stolarczyk

Thanks, @stolarczyk — I've pushed a small tweak, above, so that the linter is happy. But looks like we're missing a bit of test coverage:

Nov 22 '24 13:11 jsvine

@jsvine, the test has been added.

Nov 22 '24 15:11 stolarczyk

Thanks! Merged into develop

Dec 09 '24 04:12 jsvine