PHPWord
PHPWord copied to clipboard
How Read Doc Or DocX
Hello Thank you For Best Library Word Processing I Need Read Docx File And Extract : 1- Text 2- All Images 3- All Link with Title Please Help Me And Guide Me For Reading File Docx I Read Document and All your Examples But I Can not Found Read Element and Section Example Please Help Me thx
@lucaswhob something like this?
$objReader = \PhpOffice\PhpWord\IOFactory::createReader('Word2007');
$phpWord = $objReader->load('my/file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';
foreach ($phpWord->getSections() as $section) {
foreach ($section->getElements() as $element) {
if ($element instanceof \PhpOffice\PhpWord\Element\Text) {
$text .= $element->getText();
}
// and so on for other element types (see src/PhpWord/Element)
}
}
@gisostallenberg
Got no output on echo $text;, and no error either.
The reader documentation of DOCX file at https://github.com/PHPOffice/PHPWord/blob/develop/samples/Sample_11_ReadWord2007.php has no useful information about how to actually read a Word 2007 file.
@nikunjbhatt
This was just a simple example. Sections seem to also contain TextRun's (these are containers), which contain sub elements. Something like this should work:
<?php
use PhpOffice\PhpWord\Element\AbstractContainer;
use PhpOffice\PhpWord\Element\Text;
use PhpOffice\PhpWord\IOFactory as WordIOFactory;
require_once __DIR__.'/vendor/autoload.php';
$objReader = WordIOFactory::createReader('Word2007');
$phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord
$text = '';
function getWordText($element) {
$result = '';
if ($element instanceof AbstractContainer) {
foreach ($element->getElements() as $element) {
$result .= getWordText($element);
}
} elseif ($element instanceof Text) {
$result .= $element->getText();
}
// and so on for other element types (see src/PhpWord/Element)
return $result;
}
foreach ($phpWord->getSections() as $section) {
foreach ($section->getElements() as $element) {
$text .= getWordText($element);
}
}
echo $text;
Might I suggest a small improvement to the recursive method since it has the opportunity to miss text from several object types
// I would assume this is being run in the context of a Class
public function getDocumentText(string $filepath): string
{
$document = IOFactory::createReader('Word2007')
->load($filepath);
$documentText = '';
foreach ($document->getSections() as $section) {
foreach ($section->getElements() as $element) {
$text = $this->getElementText($element);
if (strlen($text)) {
// This ensures that the text from one section doesn't stickRightToTheNextSectionLikeThis
$documentText.= $this->getElementText($element) . "\n";
}
}
}
return $documentText;
}
protected function getElementText($element): string
{
$result = '';
if ($element instanceof AbstractContainer) {
foreach ($element->getElements() as $subElement) {
$result .= $this->getElementText($subElement);
}
}
if (method_exists($element, 'getText')) {
$result .= $element->getText();
}
return $result;
}
Sorry for hijacking the topic, but I have a related question. I am also walking the document object tree in some recursive implementation. I try to extract a "table of contents", so I am looking for PhpOffice\PhpWord\Element\Title objects. Unfortunately even though the documents seems to be formatted properly, the object model will not give me any such objects. I can see only PhpOffice\PhpWord\Element\TextRun|Text|Break|Image|...
The XML looks like this
<w:p xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="02051CF4" w14:paraId="4E47C1E7" wp14:textId="5ECEFD8F">
<w:pPr>
<w:pStyle w:val="Title"/>
<w:rPr>
<w:rFonts w:ascii="Calibri Light" w:hAnsi="Calibri Light" w:eastAsia="" w:cs=""/>
<w:sz w:val="56"/>
<w:szCs w:val="56"/>
</w:rPr>
</w:pPr>
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>
<w:r w:rsidR="7A933B85">
<w:rPr/>
<w:t xml:space="preserve">The </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidR="7A933B85">
<w:rPr/>
<w:t>document</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidR="7A933B85">
<w:rPr/>
<w:t xml:space="preserve"> title</w:t>
</w:r>
</w:p>
Do you have any suggestions?
@nikunjbhatt
Ce n'était qu'un simple exemple. Les sections semblent également contenir des TextRun (ce sont des conteneurs), qui contiennent des sous-éléments. Quelque chose comme ça devrait fonctionner :
<?php use PhpOffice\PhpWord\Element\AbstractContainer; use PhpOffice\PhpWord\Element\Text; use PhpOffice\PhpWord\IOFactory as WordIOFactory; require_once __DIR__.'/vendor/autoload.php'; $objReader = WordIOFactory::createReader('Word2007'); $phpWord = $objReader->load('file.docx'); // instance of \PhpOffice\PhpWord\PhpWord $text = ''; function getWordText($element) { $result = ''; if ($element instanceof AbstractContainer) { foreach ($element->getElements() as $element) { $result .= getWordText($element); } } elseif ($element instanceof Text) { $result .= $element->getText(); } // and so on for other element types (see src/PhpWord/Element) return $result; } foreach ($phpWord->getSections() as $section) { foreach ($section->getElements() as $element) { $text .= getWordText($element); } } echo $text;
Thank you for showing how to take the content of a docx file. But I would like you to show me how I can take the content of a doc file please?
Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?
A method like Reader::getContentAsPlainText() would be very useful!
A method like
Reader::getContentAsPlainText()would be very useful! Can you this method that you give me with my code?
Yes, I am trying to convert the doc extension file to text. In the examples given, we can convert the docx file to text. How can we convert a doc extension file to text?
But how can i do it with my code? Or Can you give me a code to do it?