arcadia icon indicating copy to clipboard operation
arcadia copied to clipboard

Extract mathematical formulas from PDF files.

Open ggservice007 opened this issue 11 months ago • 2 comments

what

Extract mathematical formulas from PDF files.

for example image

extract it and save it with LaTeX code.

ggservice007 avatar Mar 14 '24 08:03 ggservice007

pdfimages can get the image from pdf file using command pdfimages -list <path of pdf file>

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     164   164  index   1   8  image  no        18  0   197   197  647B 2.4%
   1     1 smask     164   164  gray    1   8  image  no        18  0   197   197   51B 0.2%
   1     2 image     893   550  index   1   8  image  no        19  0   150   150 1601B 0.3%
   1     3 smask     893   550  gray    1   8  image  no        19  0   150   150  545B 0.1%
   1     4 image     166    43  icc     3   8  image  no        20  0   151   152 9226B  43%
   1     5 smask     166    43  gray    1   8  image  no        20  0   151   152   32B 0.4%
   1     6 image     183   254  icc     3   8  jpeg   no        21  0   220   220 10.1K 7.4%
   3     7 image     615   579  rgb     3   8  jpx    yes       64  0   220   220 35.7K 3.4%
   4     8 image     606   589  rgb     3   8  jpx    yes       69  0   220   220 25.4K 2.4%
   7     9 image     606   672  rgb     3   8  jpx    yes       82  0   220   220 40.0K 3.4%

page: The page number of the image in the PDF file. num: A unique identifier for each image on the page. type: The type of image, such as "image" or "smask" (soft mask). width: The width of the image in pixels. height: The height of the image in pixels. color: The color space of the image, such as "index" (indexed color) or "gray" (grayscale). comp: The number of color components in the image. bpc: The number of bits per color component. enc: The encoding type of the image, such as "image" or "jpeg". interp: Indicates whether the image has an interpolation algorithm applied. object ID: The object ID of the image in the PDF file. x-ppi: The horizontal resolution of the image in pixels per inch (PPI). y-ppi: The vertical resolution of the image in pixels per inch (PPI). size: The size of the image file. ratio: The compression ratio of the image.

nkwangleiGIT avatar Mar 15 '24 03:03 nkwangleiGIT

For LateX OCR, refer to https://github.com/lukas-blecher/LaTeX-OCR

nkwangleiGIT avatar Mar 15 '24 03:03 nkwangleiGIT