MinerU icon indicating copy to clipboard operation
MinerU copied to clipboard

ppt格式的文档能否支持解析?

Open chuanbei888 opened this issue 1 year ago • 10 comments

chuanbei888 avatar Aug 02 '24 02:08 chuanbei888

https://github.com/opendatalab/magic-doc

it will work at ppt/pptx files

myhloli avatar Aug 02 '24 02:08 myhloli

https://github.com/opendatalab/magic-doc

it will work at ppt/pptx files

If you want hight quality extract result you should convert ppt to pdf, then use minerU. if you want fast extract speed but do not care extract quality you should choose maic-doc

drunkpig avatar Aug 02 '24 02:08 drunkpig

@chuanbei888 try to convert ppt to pdf with libreoffice

drunkpig avatar Aug 02 '24 02:08 drunkpig

libreoffice --invisible --convert-to docx:'MS Word 2007 XML' /path/to/mydoc.doc --outdir /output/dir

drunkpig avatar Aug 02 '24 02:08 drunkpig

https://github.com/opendatalab/magic-doc it will work at ppt/pptx files

If you want hight quality extract result you should convert ppt to pdf, then use minerU. if you want fast extract speed but do not care extract quality you should choose maic-doc

Okay, I will have a try.

chuanbei888 avatar Aug 02 '24 02:08 chuanbei888

请教一下,对于ppt和docx转markdown的方案选择上,转成pdf再用magic-pdf 和 直接用magic-doc 这两个方案哪个效果更佳?

先转pdf再转md,会不会导致部分文字的识别 不如直接读取的好?

thorory avatar Aug 02 '24 09:08 thorory

请教一下,对于ppt和docx转markdown的方案选择上,转成pdf再用magic-pdf 和 直接用magic-doc 这两个方案哪个效果更佳?

先转pdf再转md,会不会导致部分文字的识别 不如直接读取的好?

magic-doc文本提取能力强,速度更快,但是最终输出是不包含任何图片的。 转pdf之后使用magic-pdf提取,可以实现较好的图片排版效果,缺点是速度较慢。

myhloli avatar Aug 02 '24 09:08 myhloli

docx转pdf有没有批量的工具

zouhuigang avatar Aug 06 '24 02:08 zouhuigang

@zouhuigang liberoffice

drunkpig avatar Aug 08 '24 10:08 drunkpig

Any tool you recommend that convert ppt to pdf?

Victor94-king avatar Oct 24 '24 15:10 Victor94-king

@myhloli 想问一下,项目有木有exe版本?maic-doc会将pdf转成word吗?

yiyahei-eng avatar Dec 19 '24 01:12 yiyahei-eng

@myhloli 想问一下,项目有木有exe版本?maic-doc会将pdf转成word吗?

目前本地部署为完整客户端,需要在python环境运行,近期我们会发布基于云服务的微端,会有exe的版本。

没有pdf2word的计划。

myhloli avatar Dec 19 '24 01:12 myhloli

@myhloli 如果我想要将转为markdown改为转word。我应该在项目中的哪个文件修改代码呢?

yiyahei-eng avatar Dec 19 '24 01:12 yiyahei-eng

@myhloli 如果我想要将转为markdown改为转word。我应该在项目中的哪个文件修改代码呢?

在产出markdown后使用pandoc转换

myhloli avatar Dec 19 '24 01:12 myhloli

@myhloli 好的谢谢

yiyahei-eng avatar Dec 19 '24 01:12 yiyahei-eng