marker icon indicating copy to clipboard operation
marker copied to clipboard

Can it process images in pdf?

Open biandh opened this issue 1 year ago • 14 comments

biandh avatar Dec 01 '23 06:12 biandh

It will skip the images currently. It's possible to save the images separately and embed them into the markdown, though - what's your usecase for the images?

VikParuchuri avatar Dec 01 '23 06:12 VikParuchuri

I also think it is better to process image and table , like mathpix.

DamonsJ avatar Dec 01 '23 06:12 DamonsJ

It will extract tables, just not images. Can you tell me more about what you're using the images for?

VikParuchuri avatar Dec 01 '23 06:12 VikParuchuri

sometimes, you need image to illustrate the pointview in your pdf ex, if you are working on compute graphics there are many graphs and images in paper and you want to collect it in you own note.

DamonsJ avatar Dec 01 '23 06:12 DamonsJ

I would find it very helpful to have images extracted so that I can convert a homework assignment pdf into a complete markdown file. How would you recommend extracting images?

kshitijsachan avatar Dec 01 '23 16:12 kshitijsachan

This could be a good issue to work on!

mahimairaja avatar Dec 02 '23 06:12 mahimairaja

It will skip the images currently. It's possible to save the images separately and embed them into the markdown, though - what's your usecase for the images?

For me, I would use Obsidian to read books and record my comments, so I would like the images included for better reading. maybe the final file branch can be: -Book name --book name.md --image ---image 1 ---image 2

keno-log avatar Dec 03 '23 15:12 keno-log

We really need this feature.🔥Does anyone know if there are any alternatives that can replace this project?

Hambaobao avatar Dec 12 '23 08:12 Hambaobao

i just tried groundingdino for trying to draw bounding boxes around the figures to then extract them. bad results. i wonder how far you can get by fine-tuning groundingdino for this task.

yachty66 avatar Dec 12 '23 17:12 yachty66

It is possible to use fitz/PyMuPDF to extract image at each page (just not at the exact position like docx files), save it to a position and label it as markdown format with the safe path.

tungsten106 avatar Dec 13 '23 03:12 tungsten106

If anyone wants to contribute this with a PR, I'd be very excited to review. I'm working on improving some of the base models and making marker fully open (it's noncommercial right now due to nougat and layoutlmv3 licensing), so I don't have bandwidth to take the image project on at the moment.

The segmentation model identifies image positions, so it may be possible to extract images using that, and embed them in the right spot.

VikParuchuri avatar Dec 13 '23 03:12 VikParuchuri

It is possible to use fitz/PyMuPDF to extract image at each page (just not at the exact position like docx files), save it to a position and label it as markdown format with the safe path.

@tungsten106 no you cannot really - at least its not good. fitz/PyMuPDF is not getting all images

yachty66 avatar Dec 13 '23 21:12 yachty66

Yes please. My use case: pdf => markdown => html => png images => canvas

7flash avatar Feb 27 '24 21:02 7flash

I'm training a model to extract images - this will be integrated into marker

VikParuchuri avatar Feb 28 '24 06:02 VikParuchuri

Image extraction will be coming in the next version (should be shipped in the next 2 weeks).

VikParuchuri avatar May 03 '24 05:05 VikParuchuri

Just added this into the dev branch - https://github.com/VikParuchuri/marker/pull/111 . I'm going to close this issue, since the feature will land in master soon (next few days).

VikParuchuri avatar May 07 '24 18:05 VikParuchuri