ragflow
ragflow copied to clipboard
[Bug]: Text location parsed from PDF isn't matched with the location parsed by OCR
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch name
master
Commit ID
无
Other environment information
No response
Actual behavior
pdf 提取的文字坐标和ocr提取的文字坐标不一致导致无法匹配
Expected behavior
No response
Steps to reproduce
pdf 提取的文字坐标和ocr提取的文字坐标不一致导致无法匹配
for c in Recognizer.sort_X_firstly(
chars, self.mean_width[pagenum - 1] // 4):
ii = Recognizer.find_overlapped(c, bxs)
if ii == 24:
print(c)
if ii is None:
self.lefted_chars.append(c)
continue
ch = c["bottom"] - c["top"]
if c["text"] != " " and c["text"] != "":
bxs[ii]["font_height"] = c["height"]
bh = bxs[ii]["bottom"] - bxs[ii]["top"]
if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != ' ':
self.lefted_chars.append(c)
continue
if c["text"] == " " and bxs[ii]["text"]:
if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]):
bxs[ii]["text"] += " "
else:
#bxs[ii]["text"] += c["text"]
box_result = get_char_item(c,ii)
box_results.append(box_result)
# 首先对数据按 "ii" 排序
sorted_data = sorted(box_results, key=lambda x: x["ii"])
# 使用 groupby 进行分组
grouped_data = {k: list(v) for k, v in groupby(sorted_data, key=lambda x: x["ii"])}
# 打印结果
for ii, items in grouped_data.items():
sort_item =Recognizer.sort_Y_firstly(
items, self.mean_width[pagenum - 1] // 3)
texts = [item["text"] for item in sort_item]
combined_text = ''.join(texts) # 使用空格连接文本
bxs[ii]["text"] = combined_text
for b in bxs:
if not b["text"]:
left, right, top, bott = b["x0"] * ZM, b["x1"] * \
ZM, b["top"] * ZM, b["bottom"] * ZM
b["text"] = self.ocr.recognize(np.array(img),
np.array([[left, top], [right, top], [right, bott], [left, bott]],
dtype=np.float32))
del b["txt"]
bxs = [b for b in bxs if b["text"]]
if self.mean_height[-1] == 0:
self.mean_height[-1] = np.median([b["bottom"] - b["top"]
for b in bxs])
self.boxes.append(bxs)
Additional information
No response
I didn't get it.
Close due to no response.