open-webui icon indicating copy to clipboard operation
open-webui copied to clipboard

enh: marker integration for better pdf parsing

Open tjbck opened this issue 1 year ago • 7 comments

tjbck avatar Nov 11 '24 02:11 tjbck

What about docling support?

sir3mat avatar Nov 12 '24 11:11 sir3mat

What about docling support?

Seems interesting,

maybe https://github.com/drmingler/docling-api is also worth looking at.

jannikstdl avatar Nov 13 '24 08:11 jannikstdl

MinerU is quite promising.

I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.

MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.

Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩

screenshot

Under the hood, it uses

  • fine-tuned YOLOv8 for formula detection
  • UniMERNet for formula recognition

However, the set up is ~~complex~~ (it is easy now), I provided a guide at https://github.com/opendatalab/MinerU/discussions/1374 if you want to try MinerU ☺️

hongbo-miao avatar Dec 28 '24 02:12 hongbo-miao

Recently, there is a new one from Microsoft: MarkItDown. However, it currently lacks support for formulas as well: https://github.com/microsoft/markitdown/issues/17

hongbo-miao avatar Jan 06 '25 23:01 hongbo-miao

Perhaps a good solution will be to design a generic interface. As we have in the general parameter the Content Extraction, Engine" which is set to "default|tika" perhaps we could have a generic interface to plug to different Extraction content servers?

The integration of a enhanced PDF extractor able to carry scientific papers will be very impacting for us as we have different phd students and scientists.

flefevre avatar Jan 25 '25 08:01 flefevre

Some independant people of our laboratory made some tests comparing basically openwebui/mistralsmall and adobe/reader rag. They produce a bad report on Openwebui, trying to convince not to use it. It was mainly due to the fact they didn't know about the architecture of openwebui and missed the fact that RAG in openwebui is dependant of the first module "tika" as ocr extraction tools.

In order to keep people to use openwebui on scientific thematic, we need to have some advices on how to parameterize Openwebui to have good results.

  • Is there a difference beween chroma and milvius?
  • Do you plan to add more documentation on this topic?
  • Do you plan to integration of another ocr tool such as docling ? if so, what is your potential roadmap

Thanks for sharing your expertise and vision

flefevre avatar Jan 30 '25 09:01 flefevre

For local experimental purposes, it takes minimal changes to use docling as a content extraction engine, see #9238. Unfortunately I will not have time in the coming days or weeks to make it ready for a release, if you have a few hours or more please do feel free to build upon this PR for docling integration.

MichaelKarpe avatar Feb 02 '25 13:02 MichaelKarpe

@tjbck will it be possbile to allow custom urls for those who want to try the self-hosted version? Seem like the address of the provider's endpoint is hard-coded.

oatmealm avatar May 30 '25 06:05 oatmealm

I did this one as external https://github.com/CodeAtCode/deadsimple probably can improved to use other tools instead of markitdown.

Mte90 avatar Jun 27 '25 11:06 Mte90

It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.

vojtapolasek avatar Jul 09 '25 12:07 vojtapolasek

It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.

This would be a great option. And I think not alot of code would be involved to change the address for API calls. Would be a great addition as I've found the standard text extraction and even Docling including easyOCR and RapidOCR to be very underwhelming when it comes to formula recognition. In alot of scientific and engineering documents these formulas are essential, and marker (With the right settings!) rarely let's you down (only minor markup issues).

HenkieTenkie62 avatar Jul 16 '25 07:07 HenkieTenkie62

MinerU is quite promising.

I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.

MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.

Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩

screenshot

Under the hood, it uses

  • fine-tuned YOLOv8 for formula detection
  • UniMERNet for formula recognition

However, the set up is ~complex~ (it is easy now), I provided a guide at opendatalab/MinerU#1374 if you want to try MinerU ☺️

Now Mineru supports API mode, so it's a good time to integrate it.

homjay avatar Aug 30 '25 10:08 homjay

One shall keep in mind that using YOLOv8 in commercial contexts needs licensing, though. https://www.ultralytics.com/license

schmik avatar Sep 01 '25 07:09 schmik