pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

How to get images and text in order as in PDF?

Open salmanulfaris opened this issue 4 months ago • 5 comments

  • PDFParser Version: 2.9

Description:

I want to extract the PDF then save text to db and image to storage, but the order matters, if i take page 1, when i get an image, i need to get text coming after that.

PDF input

PDF containing some text then images in each pages,

Expected output & actual output

I need to extract the image and text in order as in the PDF How to do That ?

Code

Code I'm using for extracting the image, but text is not available here

$parser = new Parser();
$pdf = $parser->parseFile(public_path('paper.pdf'));
$objects = $pdf->getObjects();
foreach ($objects as $key => $object) {
      echo '<img src="data:image/jpg;base64,'. base64_encode($object->getContent()) .'" />';
}

salmanulfaris avatar Apr 29 '24 09:04 salmanulfaris

Without further investigation I don't think that is possible.

k00ni avatar May 03 '24 07:05 k00ni

you can use as blow

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('./test.pdf');
$objects = $pdf->getObjects();
$html = "<html><body>";


foreach ($objects as $key => $object) {
    if($object instanceof Smalot\PdfParser\XObject\Image ){
        $image = $object->getContent();
        $html .= "<img src='data:image/jpeg;base64," . base64_encode($image) . "' />";
    }else{
        $text =  $object->getText();
        $html .= "<div>{$text}</div>";
    }
}
$html .= "</body></html>";
file_put_contents('./test_to_html.html', $html);

azwhale avatar May 13 '24 03:05 azwhale

Careful here. There are objects of other types as well, so your else-part is likely to run into an error. Also, Document::getObjects might not return an ordered list. You shouldn't rely on the fact that PDFParser added objects in the same order as they appear while parsing the PDF.

Instead, you could iterate over all pages ($pdf::getPages()) and see, if you can get images and texts from them (check Page::getText and Page::getXObjects). Might worth a try.

k00ni avatar May 13 '24 06:05 k00ni

We can handle those errors, but order of the objects is very important for me, I'm scrapping PDF which is answer key of an exam, I want fetch the questions and answers from the PDF and store to DB, so Questions and options may be either text or image, so I need identify questions and it's answers from sequence of Objects

Here I'm attaching sample document Example Document.pdf

salmanulfaris avatar May 13 '24 08:05 salmanulfaris