PHPWord icon indicating copy to clipboard operation
PHPWord copied to clipboard

How Read Doc Or DocX

Open lucaswhob opened this issue 4 years ago • 10 comments
trafficstars

Hello Thank you For Best Library Word Processing I Need Read Docx File And Extract : 1- Text 2- All Images 3- All Link with Title Please Help Me And Guide Me For Reading File Docx I Read Document and All your Examples But I Can not Found Read Element and Section Example Please Help Me thx

lucaswhob avatar Jun 24 '21 07:06 lucaswhob

@lucaswhob something like this?

$objReader = \PhpOffice\PhpWord\IOFactory::createReader('Word2007');
$phpWord = $objReader->load('my/file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';
foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        if ($element instanceof \PhpOffice\PhpWord\Element\Text) {
            $text .= $element->getText();
       }
       // and so on for other element types (see src/PhpWord/Element)
    }
}

gisostallenberg avatar Jun 29 '21 09:06 gisostallenberg

@gisostallenberg Got no output on echo $text;, and no error either.

The reader documentation of DOCX file at https://github.com/PHPOffice/PHPWord/blob/develop/samples/Sample_11_ReadWord2007.php has no useful information about how to actually read a Word 2007 file.

nikunjbhatt avatar Aug 15 '21 15:08 nikunjbhatt

@nikunjbhatt

This was just a simple example. Sections seem to also contain TextRun's (these are containers), which contain sub elements. Something like this should work:

<?php

use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;

require_once __DIR__.'/vendor/autoload.php';

$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';

function getWordText($element) {
    $result = '';
    if ($element instanceof AbstractContainer) {
        foreach ($element->getElements() as $element) {
            $result .= getWordText($element);
        }
    } elseif ($element instanceof Text) {
        $result .= $element->getText();
    }
    // and so on for other element types (see src/PhpWord/Element)

    return $result;
}

foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        $text .= getWordText($element);
    }
}

echo $text;

gisostallenberg avatar Aug 18 '21 09:08 gisostallenberg

Might I suggest a small improvement to the recursive method since it has the opportunity to miss text from several object types

    // I would assume this is being run in the context of a Class
    
    public function getDocumentText(string $filepath): string
    {
        $document = IOFactory::createReader('Word2007')
            ->load($filepath);
        $documentText = '';

        foreach ($document->getSections() as $section) {
            foreach ($section->getElements() as $element) {
                $text = $this->getElementText($element);
                
                if (strlen($text)) {
                    // This ensures that the text from one section doesn't stickRightToTheNextSectionLikeThis
                    $documentText.= $this->getElementText($element) . "\n";
                }
            }
        }

        return $documentText;
    }
    
    protected function getElementText($element): string
    {
        $result = '';

        if ($element instanceof AbstractContainer) {
            foreach ($element->getElements() as $subElement) {
                $result .= $this->getElementText($subElement);
            }
        }

        if (method_exists($element, 'getText')) {
            $result .= $element->getText();
        }

        return $result;
    }

peter-at-bpt avatar Aug 31 '21 20:08 peter-at-bpt

Sorry for hijacking the topic, but I have a related question. I am also walking the document object tree in some recursive implementation. I try to extract a "table of contents", so I am looking for PhpOffice\PhpWord\Element\Title objects. Unfortunately even though the documents seems to be formatted properly, the object model will not give me any such objects. I can see only PhpOffice\PhpWord\Element\TextRun|Text|Break|Image|...

The XML looks like this

<w:p xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="02051CF4" w14:paraId="4E47C1E7" wp14:textId="5ECEFD8F">
  <w:pPr>
    <w:pStyle w:val="Title"/>
    <w:rPr>
      <w:rFonts w:ascii="Calibri Light" w:hAnsi="Calibri Light" w:eastAsia="" w:cs=""/>
      <w:sz w:val="56"/>
      <w:szCs w:val="56"/>
    </w:rPr>
  </w:pPr>
  <w:bookmarkStart w:name="_GoBack" w:id="0"/>
  <w:bookmarkEnd w:id="0"/>
  <w:r w:rsidR="7A933B85">
  <w:rPr/>
  <w:t xml:space="preserve">The </w:t>
  </w:r>
  <w:proofErr w:type="spellStart"/>
  <w:r w:rsidR="7A933B85">
    <w:rPr/>
    <w:t>document</w:t>
  </w:r>
  <w:proofErr w:type="spellEnd"/>
  <w:r w:rsidR="7A933B85">
    <w:rPr/>
    <w:t xml:space="preserve"> title</w:t>
  </w:r>
</w:p>

Do you have any suggestions?

osnard avatar Nov 16 '21 16:11 osnard

@nikunjbhatt

Ce n'était qu'un simple exemple. Les sections semblent également contenir des TextRun (ce sont des conteneurs), qui contiennent des sous-éléments. Quelque chose comme ça devrait fonctionner :

<?php

use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;

require_once __DIR__.'/vendor/autoload.php';

$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';

function getWordText($element) {
    $result = '';
    if ($element instanceof AbstractContainer) {
        foreach ($element->getElements() as $element) {
            $result .= getWordText($element);
        }
    } elseif ($element instanceof Text) {
        $result .= $element->getText();
    }
    // and so on for other element types (see src/PhpWord/Element)

    return $result;
}

foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $element) {
        $text .= getWordText($element);
    }
}

echo $text;

Thank you for showing how to take the content of a docx file. But I would like you to show me how I can take the content of a doc file please?

richardsonoge avatar Jan 04 '23 15:01 richardsonoge

Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?

mrtsglk avatar Jan 24 '24 12:01 mrtsglk

A method like Reader::getContentAsPlainText() would be very useful!

gravitiq-cm avatar Apr 06 '24 13:04 gravitiq-cm

A method like Reader::getContentAsPlainText() would be very useful! Can you this method that you give me with my code?

richardsonoge avatar Apr 06 '24 14:04 richardsonoge

Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?

But how can i do it with my code? Or Can you give me a code to do it?

richardsonoge avatar Apr 06 '24 19:04 richardsonoge