PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Rectangle detection can be incorrect producing wrong output

Open griai opened this issue 2 years ago • 7 comments

Description

In one pdf document, I noticed incorrect behavior (with respect to one specific path) after loading and then writing the pdf. (I know that, in general, PyMuPDF does not guarantee round-tripping behavior because of, e.g., clipping, which is, sadly, not supported. But this is not the issue, here.) I was able to track it down to a possibly incorrect rectangle detection.

To Reproduce (mandatory)

In order to reproduce the behavior, please take a look at the minimal example pdf I provided. The relevant part of the pdf trace is the following path:

    <stroke_path ... transform="1 0 0 -1 0 0">
        <moveto x="80.879" y="-37.141"/>
        <lineto x="3567.422" y="-37.141"/>
        <lineto x="3567.422" y="-2304.777"/>
        <lineto x="747" y="-2304.777"/>
        <lineto x="747" y="-1326.84"/>
        <lineto x="80.879" y="-1326.84"/>
        <closepath/>
        <moveto x="80.879" y="-37.141"/>
    </stroke_path>

This path produces a kind of rotated letter "L" path, which is closed.

However, when I try to load the pdf document with PyMuPDF, the path is handled incorrectly:

import fitz
demo_pdf = "path to file"
doc = fitz.open(demo_pdf)
page = next(doc.pages())
page.get_drawings()

This produces the following, incorrect, output:

[{
  'items': [
    ('l', Point(80.87899780273438, 37.14099884033203), Point(3567.422119140625, 37.14099884033203)),
    ('l', Point(3567.422119140625, 37.14099884033203), Point(3567.422119140625, 2304.777099609375)),
    ('re', Rect(80.87899780273438, 1326.8399658203125, 3567.422119140625, 2304.777099609375), -1)
  ],
  'type': 's',
  'stroke_opacity': 1.0,
  'color': (0.0, 0.0, 0.0),
  'width': 0.11999999731779099,
  'lineCap': (1, 1, 1),
  'lineJoin': 1.0,
  'closePath': False,
  ...
}]

Clearly, the rectangle that is detected, is wrong and does not appear in the original pdf document.

Your configuration (mandatory)

  • Ubuntu 22.04 64 bit
  • Python 3.10.4
  • PyMuPDF 1.19.6

Additional context (optional)

f71ad3_reduced_uncompressed.pdf

griai avatar Sep 22 '22 10:09 griai

(The behavior is the same in PyMuPDF 1.20.2.)

griai avatar Sep 22 '22 10:09 griai

The general problem here is, that the "device" interface (which is being used for "line art" extraction), never contains other atomic drawings operations than lines and curves. To provide more comfort, my code tries to regenerate higher-order objects like rectangles and quads. This obviously must be based on a few assumption / heuristics. What I am using for rectangles is this:

  • At the end of a path I am looking back and check if I was encountering "eligible" lines
  • Eligible means 3 or 4 connected lines in a row, which are parallel to the x- and the y-axis in the right, alternating way.
  • More than one rectangle in the same path can be identified in this way.

Your example shows where this heuristics may fail. So, what can be done about it? Possible alternative behaviors are:

  1. Stop detecting higher-level objects altogether - only return lines and curves, no rectangles, no quads.
  2. Only resurrect rectangles if the path contains exactly 3 orthogonal lines and nothing else. Implies stop resurrecting multiple rectangles in the same path. Also stop detecting quads, because this code is integral part of rect detection.
  3. Offer an option in method page.get_drawings(simple=True) or similar, which would cause the behavior in option 1.

That is about all what can be done, I believe. @julian-smith-artifex-com - do you have a position on this?

JorjMcKie avatar Sep 22 '22 11:09 JorjMcKie

Thanks (as always) for the extremely fast response! I am sorry, I believe I have not made the point clear enough. We are very happy with the heuristics of detecting composite shapes from the lines in a pdf path, but in this example, PyMuPDF introduces lines that have not been there, before. It would be fairly easy to check if the rectangle is decomposed into lines again and all lines were compared to what was there before. I'll add images illustrating the (to me incorrect) behavior.

griai avatar Sep 22 '22 11:09 griai

original: f71ad3_reduced_uncompressed

imported: f71ad3_reduced_uncompressed_imported

griai avatar Sep 22 '22 12:09 griai

It would be fairly easy to check if the rectangle is decomposed into lines again and all lines were compared to what was there before.

No, I'm afraid it wouldn't.

The only possibility I see is being more restrictive with shape detection:

  1. Rectangles only if the path contains exactly 3 lines and finishes with a "closePath" instruction.
  2. Quads only if the path contains exactly 4 lines where end point equals start point.

JorjMcKie avatar Sep 22 '22 13:09 JorjMcKie

I finally found a bug in the existing code, which makes my previous comments obsolete.

So definitely thanks for the well-prepared report! Issue will be resolved with the next vesion.

JorjMcKie avatar Sep 22 '22 14:09 JorjMcKie

That is very good news. Thanks a lot! And, as always, thank you for your very fast and always insightful responses!

griai avatar Sep 22 '22 15:09 griai

Fixed in 1.21.0