PyMuPDF
PyMuPDF copied to clipboard
`get_drawings`'s `items` is missing line from `h` path operator
Description of the bug
I have a PDF file with the following two squares, each one formed by two triangles:
q 1 0 0 1 300 100 cm
0 0 m
100 0 l
0 100 l
h
100 0 m
0 100 l
100 100 l
h
f
Q
q 1 0 0 1 100 100 cm
0 0 m
100 0 l
0 100 l
0 0 l
h
100 0 m
0 100 l
100 100 l
100 0 l
h
f
Q
The only difference in the second shape is the presence of a extra l
operator before the h
operator, making the h
unnecessary. The h
operator is described in 8.5.2.1 in PDF 3200:2008, and it should close the current subpath by appending a line.
When rendered, the two shapes are equal:
But when running Page.get_drawings
in this document the items
of the first drawing is missing the line draw by the h
command:
{'closePath': True,
'color': None,
'dashes': None,
'even_odd': False,
'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
'fill_opacity': 1.0,
'items': [('l', Point(400.0, 200.0), Point(300.0, 200.0)),
('l', Point(300.0, 200.0), Point(400.0, 100.0)),
('l', Point(300.0, 200.0), Point(400.0, 100.0)),
('l', Point(400.0, 100.0), Point(300.0, 100.0))],
'layer': '',
'lineCap': None,
'lineJoin': None,
'rect': Rect(300.0, 100.0, 400.0, 200.0),
'seqno': 0,
'stroke_opacity': None,
'type': 'f',
'width': None}
{'closePath': True,
'color': None,
'dashes': None,
'even_odd': False,
'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
'fill_opacity': 1.0,
'items': [('l', Point(200.0, 200.0), Point(100.0, 200.0)),
('l', Point(100.0, 200.0), Point(200.0, 100.0)),
('l', Point(200.0, 100.0), Point(200.0, 200.0)),
('l', Point(100.0, 200.0), Point(200.0, 100.0)),
('l', Point(200.0, 100.0), Point(100.0, 100.0)),
('l', Point(100.0, 100.0), Point(100.0, 200.0))],
'layer': '',
'lineCap': None,
'lineJoin': None,
'rect': Rect(100.0, 100.0, 200.0, 200.0),
'seqno': 1,
'stroke_opacity': None,
'type': 'f',
'width': None}
How to reproduce the bug
Run the following script:
import fitz
import sys
from pprint import pprint
pdf_path = sys.argv[1]
pdf_document = fitz.open(pdf_path)
for page in pdf_document:
for draw in page.get_drawings():
pprint(draw)
with the following sample file:
I created the sample file by manually editing another file, and fixing the stream length with qpdf
, so don't be surprise if there is anything wrong with it.
PyMuPDF version
1.23.25
Operating system
Windows
Python version
3.12
I looked a little more into this, and I think the problem is at least a lack of documentation.
I run the code above in the Drawing and Graphics guide, and it still produced the expected PDF:
import fitz
from pprint import pprint
doc = fitz.open("testtriangles.pdf")
page = doc[0]
paths = page.get_drawings() # extract existing drawings
outpdf = fitz.open()
outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
shape = outpage.new_shape() # make a drawing canvas for the output page
for path in paths:
pprint(path)
for item in path["items"]: # these are the draw commands
if item[0] == "l": # line
shape.draw_line(item[1], item[2])
elif item[0] == "re": # rectangle
shape.draw_rect(item[1])
elif item[0] == "qu": # quad
shape.draw_quad(item[1])
elif item[0] == "c": # curve
shape.draw_bezier(item[1], item[2], item[3], item[4])
else:
raise ValueError("unhandled drawing", item)
shape.finish(
fill=path["fill"], # fill color
color=path["color"], # line color
dashes=path["dashes"], # line dashing
even_odd=path.get("even_odd", True), # control color of overlaps
closePath=path["closePath"] or True, # whether to connect last and first point
lineJoin=path["lineJoin"] or 0, # how line joins should look like
lineCap=path["lineCap"] or 0, # how line ends should look like
width=path["width"], # line width
stroke_opacity=path.get("stroke_opacity", 1) or 1, # same value for both
fill_opacity=path.get("fill_opacity", 1) or 1, # opacity parameters
)
shape.commit()
outpdf.save("drawings-page-0.pdf")
The resulting PDF, uncompressed:
%PDF-1.7
%¿÷¢þ
1 0 obj
<< /Pages 2 0 R /Type /Catalog >>
endobj
2 0 obj
<< /Count 1 /Kids [ 3 0 R ] /Type /Pages >>
endobj
3 0 obj
<< /Contents [ 4 0 R ] /MediaBox [ 0 0 916 964 ] /Parent 2 0 R /Resources 5 0 R /Rotate 0 /Type /Page >>
endobj
4 0 obj
<< /Length 218 >>
stream
q
300 864 m
400 864 l
300 764 l
400 864 m
300 764 l
400 764 l
h
0.917647 0.898039 0.823529 rg f
Q
q
100 864 m
200 864 l
100 764 l
100 864 l
200 864 m
100 764 l
200 764 l
200 864 l
h
0.917647 0.898039 0.823529 rg f
Q
endstream
endobj
5 0 obj
<< >>
endobj
xref
0 6
0000000000 65535 f
0000000015 00000 n
0000000064 00000 n
0000000123 00000 n
0000000243 00000 n
0000000511 00000 n
trailer << /Root 1 0 R /Size 6 /ID [<5031f1b55beb389b6db47e7066e40862><608618d6bef2634daabcad57fdd3db31>] >>
startxref
532
%%EOF
I think the problem here is that if the path is closed, a closing line will be implicitly added to each subpath of the path. And the Shape API defines a subpath as a sequence of connected drawing operations, creating a new one every time an operation is disconnected from the previous. I could not find this explained in the documentation, but maybe this is derived from the PDF spec.
I also found that issue #1863 reported the same problem, but it was closed without a satisfactory response.
I have a fix in my tree. It delivers the following result upon page.get_drawings()
:
[{'closePath': False,
'color': None,
'dashes': None,
'even_odd': False,
'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
'fill_opacity': 1.0,
'items': [('l', Point(300.0, 100.0), Point(400.0, 100.0)),
('l', Point(400.0, 100.0), Point(300.0, 200.0)),
('l', Point(300.0, 200.0), Point(300.0, 100.0)),
('l', Point(400.0, 100.0), Point(300.0, 200.0)),
('l', Point(300.0, 200.0), Point(400.0, 200.0)),
('l', Point(400.0, 200.0), Point(400.0, 100.0))],
'layer': '',
'lineCap': None,
'lineJoin': None,
'rect': Rect(300.0, 100.0, 400.0, 200.0),
'seqno': 0,
'stroke_opacity': None,
'type': 'f',
'width': None},
{'closePath': False,
'color': None,
'dashes': None,
'even_odd': False,
'fill': (0.9176470041275024, 0.8980389833450317, 0.8235290050506592),
'fill_opacity': 1.0,
'items': [('l', Point(100.0, 100.0), Point(200.0, 100.0)),
('l', Point(200.0, 100.0), Point(100.0, 200.0)),
('l', Point(100.0, 200.0), Point(100.0, 100.0)),
('l', Point(200.0, 100.0), Point(100.0, 200.0)),
('l', Point(100.0, 200.0), Point(200.0, 200.0)),
('l', Point(200.0, 200.0), Point(200.0, 100.0))],
'layer': '',
'lineCap': None,
'lineJoin': None,
'rect': Rect(100.0, 100.0, 200.0, 200.0),
'seqno': 1,
'stroke_opacity': None,
'type': 'f',
'width': None}]
Because there is no way to split up a path just because additional intervening "close_path" instructions, I am simulating this by inserting the additional line myself - lines 3 and 6 in both the above two cases.
As a consequence, the overall Python path will have key 'closePath': False
because this already has happened.
Please be aware, that the method
get_drawings()
does not know anything about PDF or even access the page appearance source. Instead it will work for all relevant supported MuPDF document types. MuPDF access method therefore delivers abstracted (PDF-independent) information and will e.g. dissolve rectangles in 3 separate lines together with a "close_path" command. Here is how MuPDF's own extractor interprets your example PDF.
<document filename="testtriangles.pdf">
<page number="1" mediabox="0 0 916 964">
<set_default_colorspaces gray="DeviceGray" rgb="DeviceRGB" cmyk="DeviceCMYK" oi="None"/>
<group bbox="0 0 916 964" isolated="1" knockout="0" blendmode="Normal" alpha="1">
<clip_path winding="nonzero" transform="1 0 0 1 0 0">
<moveto x="0" y="0"/>
<lineto x="916" y="0"/>
<lineto x="916" y="964"/>
<lineto x="0" y="964"/>
<closepath/>
</clip_path>
<fill_path winding="nonzero" colorspace="DeviceRGB" color=".917647 .898039 .823529" ri="1" bp="1" op="0" opm="0" transform="1 0 0 1 300 100">
<moveto x="0" y="0"/>
<lineto x="100" y="0"/>
<lineto x="0" y="100"/>
<closepath/>
<moveto x="100" y="0"/>
<lineto x="0" y="100"/>
<lineto x="100" y="100"/>
<closepath/>
</fill_path>
<fill_path winding="nonzero" colorspace="DeviceRGB" color=".917647 .898039 .823529" ri="1" bp="1" op="0" opm="0" transform="1 0 0 1 100 100">
<moveto x="0" y="0"/>
<lineto x="100" y="0"/>
<lineto x="0" y="100"/>
<lineto x="0" y="0"/>
<closepath/> // ignored because we already are connected to start point
<moveto x="100" y="0"/>
<lineto x="0" y="100"/>
<lineto x="100" y="100"/>
<lineto x="100" y="0"/>
<closepath/> // ignored - see above
</fill_path>
<pop_clip/>
</group>
</page>
</document>
Fixed in 1.24.0.