PyMuPDF
PyMuPDF copied to clipboard
Rectangle detection can be incorrect producing wrong output
Description
In one pdf document, I noticed incorrect behavior (with respect to one specific path) after loading and then writing the pdf. (I know that, in general, PyMuPDF does not guarantee round-tripping behavior because of, e.g., clipping, which is, sadly, not supported. But this is not the issue, here.) I was able to track it down to a possibly incorrect rectangle detection.
To Reproduce (mandatory)
In order to reproduce the behavior, please take a look at the minimal example pdf I provided. The relevant part of the pdf trace is the following path:
<stroke_path ... transform="1 0 0 -1 0 0">
<moveto x="80.879" y="-37.141"/>
<lineto x="3567.422" y="-37.141"/>
<lineto x="3567.422" y="-2304.777"/>
<lineto x="747" y="-2304.777"/>
<lineto x="747" y="-1326.84"/>
<lineto x="80.879" y="-1326.84"/>
<closepath/>
<moveto x="80.879" y="-37.141"/>
</stroke_path>
This path produces a kind of rotated letter "L" path, which is closed.
However, when I try to load the pdf document with PyMuPDF, the path is handled incorrectly:
import fitz
demo_pdf = "path to file"
doc = fitz.open(demo_pdf)
page = next(doc.pages())
page.get_drawings()
This produces the following, incorrect, output:
[{
'items': [
('l', Point(80.87899780273438, 37.14099884033203), Point(3567.422119140625, 37.14099884033203)),
('l', Point(3567.422119140625, 37.14099884033203), Point(3567.422119140625, 2304.777099609375)),
('re', Rect(80.87899780273438, 1326.8399658203125, 3567.422119140625, 2304.777099609375), -1)
],
'type': 's',
'stroke_opacity': 1.0,
'color': (0.0, 0.0, 0.0),
'width': 0.11999999731779099,
'lineCap': (1, 1, 1),
'lineJoin': 1.0,
'closePath': False,
...
}]
Clearly, the rectangle that is detected, is wrong and does not appear in the original pdf document.
Your configuration (mandatory)
- Ubuntu 22.04 64 bit
- Python 3.10.4
- PyMuPDF 1.19.6
Additional context (optional)
(The behavior is the same in PyMuPDF 1.20.2.)
The general problem here is, that the "device" interface (which is being used for "line art" extraction), never contains other atomic drawings operations than lines and curves. To provide more comfort, my code tries to regenerate higher-order objects like rectangles and quads. This obviously must be based on a few assumption / heuristics. What I am using for rectangles is this:
- At the end of a path I am looking back and check if I was encountering "eligible" lines
- Eligible means 3 or 4 connected lines in a row, which are parallel to the x- and the y-axis in the right, alternating way.
- More than one rectangle in the same path can be identified in this way.
Your example shows where this heuristics may fail. So, what can be done about it? Possible alternative behaviors are:
- Stop detecting higher-level objects altogether - only return lines and curves, no rectangles, no quads.
- Only resurrect rectangles if the path contains exactly 3 orthogonal lines and nothing else. Implies stop resurrecting multiple rectangles in the same path. Also stop detecting quads, because this code is integral part of rect detection.
- Offer an option in method
page.get_drawings(simple=True)
or similar, which would cause the behavior in option 1.
That is about all what can be done, I believe. @julian-smith-artifex-com - do you have a position on this?
Thanks (as always) for the extremely fast response! I am sorry, I believe I have not made the point clear enough. We are very happy with the heuristics of detecting composite shapes from the lines in a pdf path, but in this example, PyMuPDF introduces lines that have not been there, before. It would be fairly easy to check if the rectangle is decomposed into lines again and all lines were compared to what was there before. I'll add images illustrating the (to me incorrect) behavior.
original:
imported:
It would be fairly easy to check if the rectangle is decomposed into lines again and all lines were compared to what was there before.
No, I'm afraid it wouldn't.
The only possibility I see is being more restrictive with shape detection:
- Rectangles only if the path contains exactly 3 lines and finishes with a "closePath" instruction.
- Quads only if the path contains exactly 4 lines where end point equals start point.
I finally found a bug in the existing code, which makes my previous comments obsolete.
So definitely thanks for the well-prepared report! Issue will be resolved with the next vesion.
That is very good news. Thanks a lot! And, as always, thank you for your very fast and always insightful responses!
Fixed in 1.21.0