pdf2docx Analyzing document... cost 6 min due to lots of duplicated images

convert log

[INFO] Start to convert g://pdf/5CE48DAAB7DB616A.pdf
[INFO] [1/4] Opening document...
[INFO] [2/4] Analyzing document...
[INFO] [3/4] Parsing pages...
[INFO] (1/29) Page 1
[INFO] (2/29) Page 2
[INFO] (3/29) Page 3
[INFO] (4/29) Page 4
[INFO] (5/29) Page 5
[INFO] (6/29) Page 6
[INFO] (7/29) Page 7
[INFO] (8/29) Page 8
[INFO] (9/29) Page 9
[INFO] (10/29) Page 10
[INFO] (11/29) Page 11
[INFO] (12/29) Page 12
[INFO] (13/29) Page 13
[INFO] (14/29) Page 14
[INFO] (15/29) Page 15
[INFO] (16/29) Page 16
[INFO] (17/29) Page 17
[INFO] (18/29) Page 18
[INFO] (19/29) Page 19
[INFO] (20/29) Page 20
[INFO] (21/29) Page 21
[INFO] (22/29) Page 22
[INFO] (23/29) Page 23
[INFO] (24/29) Page 24
[INFO] (25/29) Page 25
[INFO] (26/29) Page 26
[INFO] (27/29) Page 27
[INFO] (28/29) Page 28
[INFO] (29/29) Page 29
[INFO] [4/4] Creating pages...
[INFO] (1/29) Page 1
[INFO] (2/29) Page 2
[INFO] (3/29) Page 3
[INFO] (4/29) Page 4
[INFO] (5/29) Page 5
[INFO] (6/29) Page 6
[INFO] (7/29) Page 7
[INFO] (8/29) Page 8
[INFO] (9/29) Page 9
[INFO] (10/29) Page 10
[INFO] (11/29) Page 11
[INFO] (12/29) Page 12
[INFO] (13/29) Page 13
[INFO] (14/29) Page 14
[INFO] (15/29) Page 15
[INFO] (16/29) Page 16
[INFO] (17/29) Page 17
[INFO] (18/29) Page 18
[INFO] (19/29) Page 19
[INFO] (20/29) Page 20
[INFO] (21/29) Page 21
[INFO] (22/29) Page 22
[INFO] (23/29) Page 23
[INFO] (24/29) Page 24
[INFO] (25/29) Page 25
[INFO] (26/29) Page 26
[INFO] (27/29) Page 27
[INFO] (28/29) Page 28
[INFO] (29/29) Page 29
[INFO] Terminated in 358.64s.

Apr 26 '22 03:04 bikerr

应该是这里的问题common/Collection.py 怎么解决没有看。

Apr 27 '22 09:04 ZHangZHengEric

Analyzing document...这一步会提取PDF中的图片，并检查图片是否相邻、相接，以便将这些扫描过程中被分割的小图片（取决于扫描仪）重新拼接为人眼所见的大图。例如下图的每一个矩形框都是一张被分割的小图片。

Jun 14 '22 08:06 dothinking

耗时长的直接原因如 @ZHangZHengEric 所说，这一步用O(n^2)的算法检查两两相交情况。但目测之下，琐碎图片的数量远小于1000，经测试还不至于出现明显的等待时间。进一步发现，使用上游库PyMuPDF提取图片时出现问题，导致考虑了大量重复图片。针对这个问题，过滤重复图片即可，参考下面代码增加3行。

# line ~100 from ImagesExtractor.py

ic = Collection()
unique_rects = set() # 增加行1：集合去重
for item in self._page.get_images(full=True):
    # image item: (xref, smask, width, height, bpc, colorspace, ...)
    item = list(item)
    item[-1] = 0            
    
    # find all occurrences referenced to this image            
    rects = self._page.get_image_rects(item)  # 这一句会出现大量重复位置
    unrotated_page_bbox = self._page.cropbox # note the difference to page.rect
    for bbox in rects:
        if bbox in unique_rects: continue # 增加行2：过滤重复位置

        # ignore images outside page
        if not unrotated_page_bbox.intersects(bbox): continue

        # collect images
        unique_rects.add(bbox) # 增加行3：标记位置
        ic.append((bbox, item))

总耗时将从原来的350多秒减为7~8秒。但是，转换过程出现大量警告，导致转换后的docx缺失图片。

[WARNING] Ignore image due to inconsistent size of color and mask pixmaps: [1092, 1093, 2, 2, 1, 'Indexed', '', 'Image1092', '', 0]

只好向上游库提交issue，等待修复。

https://github.com/pymupdf/PyMuPDF/discussions/1752

Jun 14 '22 08:06 dothinking

pdf2docx pdf2docx copied to clipboard

Analyzing document... cost 6 min due to lots of duplicated images

pdf2docx
pdf2docx copied to clipboard