arcadia
arcadia copied to clipboard
Extract mathematical formulas from PDF files.
what
Extract mathematical formulas from PDF files.
for example
extract it and save it with LaTeX code.
pdfimages can get the image from pdf file using command pdfimages -list <path of pdf file>
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 164 164 index 1 8 image no 18 0 197 197 647B 2.4%
1 1 smask 164 164 gray 1 8 image no 18 0 197 197 51B 0.2%
1 2 image 893 550 index 1 8 image no 19 0 150 150 1601B 0.3%
1 3 smask 893 550 gray 1 8 image no 19 0 150 150 545B 0.1%
1 4 image 166 43 icc 3 8 image no 20 0 151 152 9226B 43%
1 5 smask 166 43 gray 1 8 image no 20 0 151 152 32B 0.4%
1 6 image 183 254 icc 3 8 jpeg no 21 0 220 220 10.1K 7.4%
3 7 image 615 579 rgb 3 8 jpx yes 64 0 220 220 35.7K 3.4%
4 8 image 606 589 rgb 3 8 jpx yes 69 0 220 220 25.4K 2.4%
7 9 image 606 672 rgb 3 8 jpx yes 82 0 220 220 40.0K 3.4%
page: The page number of the image in the PDF file. num: A unique identifier for each image on the page. type: The type of image, such as "image" or "smask" (soft mask). width: The width of the image in pixels. height: The height of the image in pixels. color: The color space of the image, such as "index" (indexed color) or "gray" (grayscale). comp: The number of color components in the image. bpc: The number of bits per color component. enc: The encoding type of the image, such as "image" or "jpeg". interp: Indicates whether the image has an interpolation algorithm applied. object ID: The object ID of the image in the PDF file. x-ppi: The horizontal resolution of the image in pixels per inch (PPI). y-ppi: The vertical resolution of the image in pixels per inch (PPI). size: The size of the image file. ratio: The compression ratio of the image.
For LateX OCR, refer to https://github.com/lukas-blecher/LaTeX-OCR