pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

You keep using that word `cast`. I do not think it means what you think it means 😀

Open dhdaines opened this issue 1 year ago • 1 comments

pdfinterp.py is full of code like this:

    def do_G(self, gray: PDFStackT) -> None:
        """Set gray level for stroking operations"""
        self.graphicstate.scolor = cast(float, gray)
        self.scs = self.csmap["DeviceGray"]

It appears that the intent here (which would be logical to a Java programmer, for instance) is to ensure that the object in question is really a float, coercing it if possible, and throwing an exception if not.

Otherwise, various other code down the line will inevitably throw some other, possibly less obvious, exception. But also, it means that in the case where an object has a union type, e.g. Color:

Color = Union[
    float,  # Greyscale
    Tuple[float, float, float],  # R, G, B
    Tuple[float, float, float, float],  # C, M, Y, K
]

one could (except one cannot, see below) reliably check at runtime which of the possible values it is.

But that's not what typing.cast does! It is actually type assertion (like as in TypeScript) - it says to mypy, "I know this is a float so quit complaining that it isn't". It does nothing at runtime at all.

This is a longstanding issue for some users of pdfminer.six, for example: https://github.com/jsvine/pdfplumber/issues/917#issuecomment-1615259362

It also turns out to be the source of some fuzz errors, since invalid or corrupted PDFs can easily have objects of the wrong type, and instead of causing a PDFSyntaxError or PDFValueError this leads to some other exception which is not caught.

dhdaines avatar Sep 19 '24 15:09 dhdaines

~Note that Color of 4 values could also be RGBA, but that's beside the point ;-)~

(this is wrong, there is no RGBA ... it could be DeviceN though)

dhdaines avatar Sep 19 '24 15:09 dhdaines

As someone who identifies as a Python programmer, it does hurt to be recognized as a Java one :smile:

These casts were introduced to satisfy mypy when we introduced mypy type checking. Since the data flow in pdfminer.six is not always as explicit as mypy needs it to be.

Ideally we restructure the code so that these casts are no longer needed. I changed the title to reflect that.

pietermarsman avatar Nov 26 '24 18:11 pietermarsman

As someone who identifies as a Python programmer, it does hurt to be recognized as a Java one 😄

My apologies ;-) not referring to you specifically!

Ideally we restructure the code so that these casts are no longer needed. I changed the title to reflect that.

In many cases they can simply be changed to num_value...

dhdaines avatar Nov 27 '24 13:11 dhdaines

If this is specific to pdfinterp.py, maybe this is fixed by https://github.com/pdfminer/pdfminer.six/pull/1100? That replaces all the casts with explict conversions, and warnings if that fails.

pietermarsman avatar Mar 30 '25 12:03 pietermarsman

If this is specific to pdfinterp.py, maybe this is fixed by #1100? That replaces all the casts with explict conversions, and warnings if that fails.

Yes, this will definitely solve most of the problems - for me it's an open question whether it's preferable to return None or just let Python raise a TypeError and catch it (possibly converting it to a PDFSyntaxError) further down the call stack, but the important thing is just to ensure that type annotations actually have some meaning.

dhdaines avatar Mar 30 '25 14:03 dhdaines