PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

`get_drawings`'s `items` is missing line from `h` path operator

Open Rodrigodd opened this issue 11 months ago • 2 comments

Description of the bug

I have a PDF file with the following two squares, each one formed by two triangles:

q 1 0 0 1 300 100 cm
0 0 m
100 0 l
0 100 l
h
100 0 m
0 100 l
100 100 l
h
f
Q

q 1 0 0 1 100 100 cm
0 0 m
100 0 l
0 100 l
0 0 l
h
100 0 m
0 100 l
100 100 l
100 0 l
h
f
Q

The only difference in the second shape is the presence of a extra l operator before the h operator, making the h unnecessary. The h operator is described in 8.5.2.1 in PDF 3200:2008, and it should close the current subpath by appending a line.

When rendered, the two shapes are equal:

image

But when running Page.get_drawings in this document the items of the first drawing is missing the line draw by the h command:

{'closePath': True,
 'color': None,
 'dashes': None,
 'even_odd': False,
 'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
 'fill_opacity': 1.0,
 'items': [('l', Point(400.0, 200.0), Point(300.0, 200.0)),
           ('l', Point(300.0, 200.0), Point(400.0, 100.0)),
           ('l', Point(300.0, 200.0), Point(400.0, 100.0)),
           ('l', Point(400.0, 100.0), Point(300.0, 100.0))],
 'layer': '',
 'lineCap': None,
 'lineJoin': None,
 'rect': Rect(300.0, 100.0, 400.0, 200.0),
 'seqno': 0,
 'stroke_opacity': None,
 'type': 'f',
 'width': None}
{'closePath': True,
 'color': None,
 'dashes': None,
 'even_odd': False,
 'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
 'fill_opacity': 1.0,
 'items': [('l', Point(200.0, 200.0), Point(100.0, 200.0)),
           ('l', Point(100.0, 200.0), Point(200.0, 100.0)),
           ('l', Point(200.0, 100.0), Point(200.0, 200.0)),
           ('l', Point(100.0, 200.0), Point(200.0, 100.0)),
           ('l', Point(200.0, 100.0), Point(100.0, 100.0)),
           ('l', Point(100.0, 100.0), Point(100.0, 200.0))],
 'layer': '',
 'lineCap': None,
 'lineJoin': None,
 'rect': Rect(100.0, 100.0, 200.0, 200.0),
 'seqno': 1,
 'stroke_opacity': None,
 'type': 'f',
 'width': None}

How to reproduce the bug

Run the following script:

import fitz
import sys
from pprint import pprint

pdf_path = sys.argv[1]

pdf_document = fitz.open(pdf_path)

for page in pdf_document:
    for draw in page.get_drawings():
        pprint(draw)

with the following sample file:

testtriangles.pdf

I created the sample file by manually editing another file, and fixing the stream length with qpdf, so don't be surprise if there is anything wrong with it.

PyMuPDF version

1.23.25

Operating system

Windows

Python version

3.12

Rodrigodd avatar Feb 27 '24 16:02 Rodrigodd

I looked a little more into this, and I think the problem is at least a lack of documentation.

I run the code above in the Drawing and Graphics guide, and it still produced the expected PDF:

import fitz
from pprint import pprint

doc = fitz.open("testtriangles.pdf")
page = doc[0]
paths = page.get_drawings()  # extract existing drawings

outpdf = fitz.open()
outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
shape = outpage.new_shape()  # make a drawing canvas for the output page

for path in paths:
    pprint(path)
    for item in path["items"]:  # these are the draw commands
        if item[0] == "l":  # line
            shape.draw_line(item[1], item[2])
        elif item[0] == "re":  # rectangle
            shape.draw_rect(item[1])
        elif item[0] == "qu":  # quad
            shape.draw_quad(item[1])
        elif item[0] == "c":  # curve
            shape.draw_bezier(item[1], item[2], item[3], item[4])
        else:
            raise ValueError("unhandled drawing", item)
    shape.finish(
        fill=path["fill"],  # fill color
        color=path["color"],  # line color
        dashes=path["dashes"],  # line dashing
        even_odd=path.get("even_odd", True),  # control color of overlaps
        closePath=path["closePath"] or True,  # whether to connect last and first point
        lineJoin=path["lineJoin"] or 0,  # how line joins should look like
        lineCap=path["lineCap"] or 0,  # how line ends should look like
        width=path["width"],  # line width
        stroke_opacity=path.get("stroke_opacity", 1) or 1,  # same value for both
        fill_opacity=path.get("fill_opacity", 1) or 1,  # opacity parameters
    )

shape.commit()
outpdf.save("drawings-page-0.pdf")

The resulting PDF, uncompressed:

%PDF-1.7
%¿÷¢þ
1 0 obj
<< /Pages 2 0 R /Type /Catalog >>
endobj
2 0 obj
<< /Count 1 /Kids [ 3 0 R ] /Type /Pages >>
endobj
3 0 obj
<< /Contents [ 4 0 R ] /MediaBox [ 0 0 916 964 ] /Parent 2 0 R /Resources 5 0 R /Rotate 0 /Type /Page >>
endobj
4 0 obj
<< /Length 218 >>
stream

q
300 864 m
400 864 l
300 764 l
400 864 m
300 764 l
400 764 l
h
0.917647 0.898039 0.823529 rg f
Q

q
100 864 m
200 864 l
100 764 l
100 864 l
200 864 m
100 764 l
200 764 l
200 864 l
h
0.917647 0.898039 0.823529 rg f
Q
endstream
endobj
5 0 obj
<< >>
endobj
xref
0 6
0000000000 65535 f 
0000000015 00000 n 
0000000064 00000 n 
0000000123 00000 n 
0000000243 00000 n 
0000000511 00000 n 
trailer << /Root 1 0 R /Size 6 /ID [<5031f1b55beb389b6db47e7066e40862><608618d6bef2634daabcad57fdd3db31>] >>
startxref
532
%%EOF

I think the problem here is that if the path is closed, a closing line will be implicitly added to each subpath of the path. And the Shape API defines a subpath as a sequence of connected drawing operations, creating a new one every time an operation is disconnected from the previous. I could not find this explained in the documentation, but maybe this is derived from the PDF spec.

I also found that issue #1863 reported the same problem, but it was closed without a satisfactory response.

Rodrigodd avatar Feb 27 '24 18:02 Rodrigodd

I have a fix in my tree. It delivers the following result upon page.get_drawings():

[{'closePath': False,
  'color': None,
  'dashes': None,
  'even_odd': False,
  'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
  'fill_opacity': 1.0,
  'items': [('l', Point(300.0, 100.0), Point(400.0, 100.0)),
            ('l', Point(400.0, 100.0), Point(300.0, 200.0)),
            ('l', Point(300.0, 200.0), Point(300.0, 100.0)),
            ('l', Point(400.0, 100.0), Point(300.0, 200.0)),
            ('l', Point(300.0, 200.0), Point(400.0, 200.0)),
            ('l', Point(400.0, 200.0), Point(400.0, 100.0))],
  'layer': '',
  'lineCap': None,
  'lineJoin': None,
  'rect': Rect(300.0, 100.0, 400.0, 200.0),
  'seqno': 0,
  'stroke_opacity': None,
  'type': 'f',
  'width': None},
 {'closePath': False,
  'color': None,
  'dashes': None,
  'even_odd': False,
  'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
  'fill_opacity': 1.0,
  'items': [('l', Point(100.0, 100.0), Point(200.0, 100.0)),
            ('l', Point(200.0, 100.0), Point(100.0, 200.0)),
            ('l', Point(100.0, 200.0), Point(100.0, 100.0)),
            ('l', Point(200.0, 100.0), Point(100.0, 200.0)),
            ('l', Point(100.0, 200.0), Point(200.0, 200.0)),
            ('l', Point(200.0, 200.0), Point(200.0, 100.0))],
  'layer': '',
  'lineCap': None,
  'lineJoin': None,
  'rect': Rect(100.0, 100.0, 200.0, 200.0),
  'seqno': 1,
  'stroke_opacity': None,
  'type': 'f',
  'width': None}]

Because there is no way to split up a path just because additional intervening "close_path" instructions, I am simulating this by inserting the additional line myself - lines 3 and 6 in both the above two cases. As a consequence, the overall Python path will have key 'closePath': False because this already has happened.

Please be aware, that the method get_drawings() does not know anything about PDF or even access the page appearance source. Instead it will work for all relevant supported MuPDF document types. MuPDF access method therefore delivers abstracted (PDF-independent) information and will e.g. dissolve rectangles in 3 separate lines together with a "close_path" command. Here is how MuPDF's own extractor interprets your example PDF.

<document filename="testtriangles.pdf">
<page number="1" mediabox="0 0 916 964">
<set_default_colorspaces gray="DeviceGray" rgb="DeviceRGB" cmyk="DeviceCMYK" oi="None"/>
<group bbox="0 0 916 964" isolated="1" knockout="0" blendmode="Normal" alpha="1">
    <clip_path winding="nonzero" transform="1 0 0 1 0 0">
        <moveto x="0" y="0"/>
        <lineto x="916" y="0"/>
        <lineto x="916" y="964"/>
        <lineto x="0" y="964"/>
        <closepath/>
    </clip_path>
        <fill_path winding="nonzero" colorspace="DeviceRGB" color=".917647 .898039 .823529" ri="1" bp="1" op="0" opm="0" transform="1 0 0 1 300 100">
            <moveto x="0" y="0"/>
            <lineto x="100" y="0"/>
            <lineto x="0" y="100"/>
            <closepath/>
            <moveto x="100" y="0"/>
            <lineto x="0" y="100"/>
            <lineto x="100" y="100"/>
            <closepath/>
        </fill_path>
        <fill_path winding="nonzero" colorspace="DeviceRGB" color=".917647 .898039 .823529" ri="1" bp="1" op="0" opm="0" transform="1 0 0 1 100 100">
            <moveto x="0" y="0"/>
            <lineto x="100" y="0"/>
            <lineto x="0" y="100"/>
            <lineto x="0" y="0"/>
            <closepath/>   // ignored because we already are connected to start point
            <moveto x="100" y="0"/>
            <lineto x="0" y="100"/>
            <lineto x="100" y="100"/>
            <lineto x="100" y="0"/>
            <closepath/>  // ignored - see above
        </fill_path>
    <pop_clip/>
</group>
</page>
</document>

JorjMcKie avatar Mar 01 '24 15:03 JorjMcKie

Fixed in 1.24.0.