MinerU icon indicating copy to clipboard operation
MinerU copied to clipboard

文字的上标和下标信息丢失

Open yiyibooks opened this issue 1 year ago • 5 comments

Description of the bug | 错误描述

论文的作者信息部分通常会有大量的上标数字,如下图 image

MinerU 解析后的 markdown 文本如下,丢失了上标信息

Aryo Pradipta Gema 1 Joshua Ong Jun Leang 1 Giwon $\mathbf{H o n g^{1}}$ Alessio Devoto 2 Alberto Carlo Maria MancinoRohit Saxena1Xuanli$\mathbf{H}\mathbf{e}^{4}$Yu Zhao1Xiaotang Du1Mohammad Reza Ghasemi Madani 5 Claire Barale 1 Robert McHardy 6 Joshua Harris 7 Jean Kaddour 4 Emile van Krieken 1 Pasquale Minervini 1

1 University of Edinburgh 2 Sapienza University of Rome 3 Polytechnic University of Bari 4 University College London 5 University of Trento 6 AssemblyAI 7 UK Health Security Agency {first.last, jong2, p.minervini}@ed.ac.uk [email protected] [email protected] [email protected] [email protected] {xuanli.he, jean.kaddour.20, robert.mchardy.20}@ucl.ac.uk

How to reproduce the bug | 如何复现

示例论文 https://arxiv.org/pdf/2406.04127 基本上所有论文都会有这个问题

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

yiyibooks avatar Jul 26 '24 10:07 yiyibooks

上下标是被当做影响上下文流畅性的因素人为删除掉了

myhloli avatar Jul 26 '24 10:07 myhloli

@yiyibooks The superscript citations in the paper have been deliberately removed, thinking that the superscripts affect the readability.

drunkpig avatar Jul 26 '24 10:07 drunkpig

谢谢 @myhloli @drunkpig !

可不可以增加一个选项保留这些文字的元信息,还有其他如粗体、斜体等信息。这些信息在渲染 markdown 给人阅读时非常有用呢

yiyibooks avatar Jul 26 '24 10:07 yiyibooks

@yiyibooks We cannot extract information such as text color and bold formatting from scanned PDFs, but we can obtain this information from text-based PDFs. This work deviates somewhat from our current main focus, so we will not be supporting the development of this feature in the near future.

drunkpig avatar Jul 26 '24 11:07 drunkpig

Get. 期待咱们能早日实现 ~

yiyibooks avatar Jul 27 '24 01:07 yiyibooks