enh: marker integration for better pdf parsing
What about docling support?
What about docling support?
Seems interesting,
maybe https://github.com/drmingler/docling-api is also worth looking at.
MinerU is quite promising.
I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.
MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.
Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩
Under the hood, it uses
- fine-tuned YOLOv8 for formula detection
- UniMERNet for formula recognition
However, the set up is ~~complex~~ (it is easy now), I provided a guide at https://github.com/opendatalab/MinerU/discussions/1374 if you want to try MinerU ☺️
Recently, there is a new one from Microsoft: MarkItDown. However, it currently lacks support for formulas as well: https://github.com/microsoft/markitdown/issues/17
Perhaps a good solution will be to design a generic interface. As we have in the general parameter the Content Extraction, Engine" which is set to "default|tika" perhaps we could have a generic interface to plug to different Extraction content servers?
The integration of a enhanced PDF extractor able to carry scientific papers will be very impacting for us as we have different phd students and scientists.
Some independant people of our laboratory made some tests comparing basically openwebui/mistralsmall and adobe/reader rag. They produce a bad report on Openwebui, trying to convince not to use it. It was mainly due to the fact they didn't know about the architecture of openwebui and missed the fact that RAG in openwebui is dependant of the first module "tika" as ocr extraction tools.
In order to keep people to use openwebui on scientific thematic, we need to have some advices on how to parameterize Openwebui to have good results.
- Is there a difference beween chroma and milvius?
- Do you plan to add more documentation on this topic?
- Do you plan to integration of another ocr tool such as docling ? if so, what is your potential roadmap
Thanks for sharing your expertise and vision
For local experimental purposes, it takes minimal changes to use docling as a content extraction engine, see #9238. Unfortunately I will not have time in the coming days or weeks to make it ready for a release, if you have a few hours or more please do feel free to build upon this PR for docling integration.
@tjbck will it be possbile to allow custom urls for those who want to try the self-hosted version? Seem like the address of the provider's endpoint is hard-coded.
I did this one as external https://github.com/CodeAtCode/deadsimple probably can improved to use other tools instead of markitdown.
It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.
It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.
This would be a great option. And I think not alot of code would be involved to change the address for API calls. Would be a great addition as I've found the standard text extraction and even Docling including easyOCR and RapidOCR to be very underwhelming when it comes to formula recognition. In alot of scientific and engineering documents these formulas are essential, and marker (With the right settings!) rarely let's you down (only minor markup issues).
MinerU is quite promising.
I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.
MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.
Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩
Under the hood, it uses
- fine-tuned YOLOv8 for formula detection
- UniMERNet for formula recognition
However, the set up is ~complex~ (it is easy now), I provided a guide at opendatalab/MinerU#1374 if you want to try MinerU ☺️
Now Mineru supports API mode, so it's a good time to integrate it.
One shall keep in mind that using YOLOv8 in commercial contexts needs licensing, though. https://www.ultralytics.com/license