ragflow [Bug]: Text location parsed from PDF isn't matched with the location parsed by OCR

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

Branch name

master

Commit ID

无

Other environment information

No response

Actual behavior

pdf 提取的文字坐标和ocr提取的文字坐标不一致导致无法匹配

Expected behavior

No response

Steps to reproduce

pdf 提取的文字坐标和ocr提取的文字坐标不一致导致无法匹配


        for c in Recognizer.sort_X_firstly(
                chars, self.mean_width[pagenum - 1] // 4):
            ii = Recognizer.find_overlapped(c, bxs)
            if ii == 24:
                print(c)
            if ii is None:
                self.lefted_chars.append(c)
                continue
            ch = c["bottom"] - c["top"]
            if c["text"] != " " and c["text"] != "":
                bxs[ii]["font_height"] = c["height"]
            bh = bxs[ii]["bottom"] - bxs[ii]["top"]
            if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != ' ':
                self.lefted_chars.append(c)
                continue
            if c["text"] == " " and bxs[ii]["text"]:
                if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]):
                    bxs[ii]["text"] += " "
            else:
                #bxs[ii]["text"] += c["text"]
                box_result = get_char_item(c,ii)
                box_results.append(box_result)
        # 首先对数据按 "ii" 排序
        sorted_data = sorted(box_results, key=lambda x: x["ii"])
        # 使用 groupby 进行分组
        grouped_data = {k: list(v) for k, v in groupby(sorted_data, key=lambda x: x["ii"])}
        # 打印结果
        for ii, items in grouped_data.items():
            sort_item =Recognizer.sort_Y_firstly(
                items, self.mean_width[pagenum - 1] // 3)
            texts = [item["text"] for item in sort_item]
            combined_text = ''.join(texts)  # 使用空格连接文本
            bxs[ii]["text"] = combined_text
        for b in bxs:
            if not b["text"]:
                left, right, top, bott = b["x0"] * ZM, b["x1"] * \
                                         ZM, b["top"] * ZM, b["bottom"] * ZM
                b["text"] = self.ocr.recognize(np.array(img),
                                               np.array([[left, top], [right, top], [right, bott], [left, bott]],
                                                        dtype=np.float32))
            del b["txt"]
        bxs = [b for b in bxs if b["text"]]
        if self.mean_height[-1] == 0:
            self.mean_height[-1] = np.median([b["bottom"] - b["top"]
                                              for b in bxs])
        self.boxes.append(bxs)

Additional information

No response

Sep 27 '24 06:09 dhking

I didn't get it.

Sep 29 '24 07:09 KevinHuSh

Close due to no response.

Nov 30 '24 15:11 yuzhichang