MinerU 你好，请问质检层相关的内容很感兴趣，你们有什么计划吗？

Jul 26 '24 08:07 xuquankun

Regarding the quality assurance of large-scale mixed PDFs, we proceed as follows:

Manually select several categories (such as papers, textbooks, exam papers, research reports, etc.) from PDF pages we have processed, not limited to these, and manually pick out the pages we have processed well. The more diversity, the better.
Convert these PDF pages into images.
These images are extracted into a feature library (using ResNet50).
When a PDF page of unknown quality is extracted, use the screenshot of this page to query the most similar PDF pages from the feature library and record the distance.
As a batch of Pdfs processing concludes, we filter out those pages whose distances are in the bottom N% and analyze them one by one to improve our model or drop these data.

@xuquankun

Jul 26 '24 08:07 drunkpig

我理解了你们这块工作的流程，请问质检工具你们会开发开源吗？

Jul 26 '24 08:07 xuquankun

@xuquankun This functionality relies on continuously data accumulating , and the tools are not fully developed. Some steps are completed by different team members using various scripts, so it is not suitable for open sourcing

Jul 26 '24 09:07 drunkpig

明白了，谢谢

Jul 26 '24 09:07 xuquankun