open-parse icon indicating copy to clipboard operation
open-parse copied to clipboard

Fix layout inversion bug

Open ic-xu opened this issue 1 year ago • 2 comments

description:

Fixed the bug that when parsing PDF, when the PDF content is converted from PPT to a file, the layout of the content is found to be reversed. As shown in the picture below, if calculated from the lower right corner of bbox, rectangle A should be ranked behind B, but if the rectangle has text, the text of rectangle A should be read first in front of rectangle B, so I think Maybe using the upper left corner of the rectangle as the basis for bbox sorting will be more suitable for most people's reading habits.

                ^
            Y  |
               |
               |
               |
               |     +----------------------------------------------+(x1,y1)
               |     |                                              |
               |     |   A                                          |
               |     |                                         (x1,y1)
               |     |            +----------------------------+    |
               |     |            |                            |    |
               |     |            |     B                      |    |
               |     |            |                            |    |
               |     |            |                            |    |
               |     |            +----------------------------+    |
               |     |            (x0,y0)                           |
               |     +----------------------------------------------+
               |    (x0,y0)
       +------------------------------------------------------------------------------------------------>
               +                                                                                       X
                                           +
                                           |
                                           |
                                           |
                                           |
                                           |
                                           |
                                           v
          ^
      Y   |          (x0,y0)
          |          +------------------------------------------------+
          |          |                                                |
          |          |                                                |
          |          |   A           (x0,y0)                          |
          |          |               +--------------------------+     |
          |          |               |                          |     |
          |          |               |   B                      |     |
          |          |               |                          |     |
          |          |               |                          |     |
          |          |               +--------------------------+     |
          |          |                                          (x1,y1)
          |          +------------------------------------------------+
          |                                                           (x1,y,)
          |
+--------------------------------------------------------------------------------->
          |                                                                     X
          |
          +

So I think when switching the coordinate system, (x0, y0) should be kept as the upper left corner point of the rectangle

ic-xu avatar Apr 25 '24 03:04 ic-xu

PyMyPdf uses a top-left coordinate system while the rest of our code uses bottom-left. As a result we need to swap these for everything to work. Do you have an example PDF?

Filimoa avatar Apr 25 '24 03:04 Filimoa

hi       Nice to receive your email reply

I found a Chinese PDF document on the Internet, but you only need to pay attention to the title of the first page and the order of the email addresses below. Don’t worry too much about the rest, so you only need to look at the parsing results of the first page. 

I have placed the PDF document in the attachment. Finally, I wish you to have a joyful mood every day.

------------------ 原始邮件 ------------------ 发件人: "Filimoa/open-parse" @.>; 发送时间: 2024年4月25日(星期四) 中午11:38 @.>; @.@.>; 主题: Re: [Filimoa/open-parse] Fix layout inversion bug (PR #33)

PyMyPdf uses a top-left coordinate system while the rest of our code uses bottom-left. As a result we need to swap these for everything to work. Do you have an example PDF?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

从QQ邮箱发来的超大附件

test_layout.pdf (6.36M, 无限期)进入下载页面:https://mail.qq.com/cgi-bin/ftnExs_download?k=7c393535f130ff9cde0d327b1761574c404d075401005302150d0650534c01000d0018540957524e5b5b5350520000535a5f0157316e65175d4a416a5d001c0c4d4d1b455507655e&t=exs_ftn_download&code=89551aec

ic-xu avatar Apr 25 '24 04:04 ic-xu