drunkpig

Results 91 comments of drunkpig

@TapXWorld 模型和人不太一样。群主可以看下这个著名的LLM训练数据 https://huggingface.co/datasets/tiiuae/falcon-refinedweb/viewer/default/train?row=0

@yiyibooks Thanks for your enthusiasm. As you see, code blocks, lists, and content list have not yet been recognized in the layout recognition model. The development of this feature is...

@WillingLau @Tendo33 we'll pub a new release 0.7.0, document will be ok.

> @WillingLau @Tendo33 we'll pub a new release 0.7.0, document will be ok. next week, about first week of Aug.

@WillingLau @Tendo33 please refer to https://github.com/opendatalab/MinerU/blob/master/README_zh-CN_v2.md, we'll release it latter.

> https://github.com/opendatalab/magic-doc > > it will work at ppt/pptx files If you want hight quality extract result you should convert ppt to pdf, then use minerU. if you want fast...

@chuanbei888 try to convert ppt to pdf with libreoffice

`libreoffice --invisible --convert-to docx:'MS Word 2007 XML' /path/to/mydoc.doc --outdir /output/dir`

@jefferyvvv 目前由于人力有限,一级标题并没有实现。实现方法如下: 由于标题识别之后可以知道标题的bbox的高度,根据高度进行聚合,排序即可。