marker icon indicating copy to clipboard operation
marker copied to clipboard

Add image label to output MD file.

Open tungsten106 opened this issue 1 year ago • 7 comments

  • Using Pymupdf package to extract image bbox and sorted with y-position, adding the MD formated image label as text to the output markdown file;
  • Image data saved in metadata.json file with key "image" and is a Dict, format: {img_path: img_byte_content}, it then could be saved to each path with the file convert_single.py.
  • Not all pictures in pdf (such as image on page 2 of Multi-column CNN) could not be identified, as noted by @yachty66. But technically that is not a picture, it is an image formed with text boxes and arrows, etc. I am unsure about how to resolve this at the moment as well. Hope it could helps :)

tungsten106 avatar Dec 27 '23 14:12 tungsten106

@tungsten106 Thanks for much for this! It was on my list of functionality to add soon. I'll take a look next week (after the holiday).

VikParuchuri avatar Dec 28 '23 02:12 VikParuchuri

@tungsten106 I'd love to review this, but the diffs seem to have issues (entire file is shown as deleted, with all the lines also shown as added). I'm having a hard time seeing what was changed. Do you know why this is happening with the diffs?

VikParuchuri avatar Jan 02 '24 19:01 VikParuchuri

@tungsten106 I'd love to review this, but the diffs seem to have issues (entire file is shown as deleted, with all the lines also shown as added). I'm having a hard time seeing what was changed. Do you know why this is happening with the diffs?

It is probably a problem raised by Windows vscode end-of-line sequence settings. I have changed its selection from CRLF back to LF, and the diff should work now.

tungsten106 avatar Jan 03 '24 16:01 tungsten106

Following to know when this is implemented. With GPT4V out, the focus is on multimodal retrieval systems. Since marker outperforms most pdf readers, the addition of images would make it very valuable for general purpose pdf loading for this purpose.

OmriNach avatar Jan 04 '24 14:01 OmriNach

Not all pictures in pdf (such as image on page 2 of Multi-column CNN) could not be identified, as noted by @yachty66. But technically that is not a picture, it is an image formed with text boxes and arrows, etc. I am unsure about how to resolve this at the moment as well.

Why can't we do somethingg like get the box and screenshot that part and add

morizin avatar Jan 24 '24 13:01 morizin

After adding the image, continue to add the translation function to the project, and right-click the image and select GPT-4-vision to answer, which will be a great essay tool.

CBIhalsen avatar Jan 29 '24 19:01 CBIhalsen

Is the image extract feature included in latest, as today, i cloned git-master branch (as there is no release) and ran i couldnt get the image in output .md file, I thought, MD file, will have image embeddings in it.. but didnt find any Should i set any variable, to extract image, and emebd it tinto, output md file?..

is this feature upcoming..

also, is there any way, I can run this on hugginface, deploy there -- can you create something similar, some remote solution

catalystK avatar Mar 12 '24 17:03 catalystK