You keep using that word `cast`. I do not think it means what you think it means 😀
pdfinterp.py is full of code like this:
def do_G(self, gray: PDFStackT) -> None:
"""Set gray level for stroking operations"""
self.graphicstate.scolor = cast(float, gray)
self.scs = self.csmap["DeviceGray"]
It appears that the intent here (which would be logical to a Java programmer, for instance) is to ensure that the object in question is really a float, coercing it if possible, and throwing an exception if not.
Otherwise, various other code down the line will inevitably throw some other, possibly less obvious, exception. But also, it means that in the case where an object has a union type, e.g. Color:
Color = Union[
float, # Greyscale
Tuple[float, float, float], # R, G, B
Tuple[float, float, float, float], # C, M, Y, K
]
one could (except one cannot, see below) reliably check at runtime which of the possible values it is.
But that's not what typing.cast does! It is actually type assertion (like as in TypeScript) - it says to mypy, "I know this is a float so quit complaining that it isn't". It does nothing at runtime at all.
This is a longstanding issue for some users of pdfminer.six, for example: https://github.com/jsvine/pdfplumber/issues/917#issuecomment-1615259362
It also turns out to be the source of some fuzz errors, since invalid or corrupted PDFs can easily have objects of the wrong type, and instead of causing a PDFSyntaxError or PDFValueError this leads to some other exception which is not caught.
~Note that Color of 4 values could also be RGBA, but that's beside the point ;-)~
(this is wrong, there is no RGBA ... it could be DeviceN though)
As someone who identifies as a Python programmer, it does hurt to be recognized as a Java one :smile:
These casts were introduced to satisfy mypy when we introduced mypy type checking. Since the data flow in pdfminer.six is not always as explicit as mypy needs it to be.
Ideally we restructure the code so that these casts are no longer needed. I changed the title to reflect that.
As someone who identifies as a Python programmer, it does hurt to be recognized as a Java one 😄
My apologies ;-) not referring to you specifically!
Ideally we restructure the code so that these
casts are no longer needed. I changed the title to reflect that.
In many cases they can simply be changed to num_value...
If this is specific to pdfinterp.py, maybe this is fixed by https://github.com/pdfminer/pdfminer.six/pull/1100? That replaces all the casts with explict conversions, and warnings if that fails.
If this is specific to pdfinterp.py, maybe this is fixed by #1100? That replaces all the casts with explict conversions, and warnings if that fails.
Yes, this will definitely solve most of the problems - for me it's an open question whether it's preferable to return None or just let Python raise a TypeError and catch it (possibly converting it to a PDFSyntaxError) further down the call stack, but the important thing is just to ensure that type annotations actually have some meaning.