drunkpig comments

Results 91 comments of


                                            drunkpig

使用PDF提取工具变成了markdown

@TapXWorld 模型和人不太一样。群主可以看下这个著名的LLM训练数据 https://huggingface.co/datasets/tiiuae/falcon-refinedweb/viewer/default/train?row=0

缺失识别程序代码块

@yiyibooks Thanks for your enthusiasm. As you see, code blocks, lists, and content list have not yet been recognized in the layout recognition model. The development of this feature is...

对输出结果中的各参数做出一个说明

@WillingLau @Tendo33 we'll pub a new release 0.7.0, document will be ok.

对输出结果中的各参数做出一个说明

> @WillingLau @Tendo33 we'll pub a new release 0.7.0, document will be ok. next week, about first week of Aug.

对输出结果中的各参数做出一个说明

@WillingLau @Tendo33 please refer to https://github.com/opendatalab/MinerU/blob/master/README_zh-CN_v2.md, we'll release it latter.

ppt格式的文档能否支持解析？

> https://github.com/opendatalab/magic-doc > > it will work at ppt/pptx files If you want hight quality extract result you should convert ppt to pdf, then use minerU. if you want fast...

ppt格式的文档能否支持解析？

@chuanbei888 try to convert ppt to pdf with libreoffice

ppt格式的文档能否支持解析？

`libreoffice --invisible --convert-to docx:'MS Word 2007 XML' /path/to/mydoc.doc --outdir /output/dir`

ppt格式的文档能否支持解析？

@zouhuigang liberoffice

多级标题

@jefferyvvv 目前由于人力有限，一级标题并没有实现。实现方法如下：由于标题识别之后可以知道标题的bbox的高度，根据高度进行聚合，排序即可。