php-readability
php-readability copied to clipboard
Undefined array key 0 after array_filter
After the array_filter on line 1481 on function hasSingleTagInsideElement the array sometimes not start with 0,
ErrorException
Undefined array key 0
at vendor/j0k3r/php-readability/src/Readability.php:1485
1481▕ $children = array_filter($childNodes, fn ($childNode) => $childNode instanceof \DOMElement);
1482▕ //$children = array_values($children);
1483▕ // There should be exactly 1 element child with given tag
1484▕
➜ 1485▕ if (1 !== \count($children) || $children[0]->nodeName !== $tag) {
1486▕ return false;
1487▕ }
1488▕
1489▕ $a = array_filter(
to fix it you have to add array_values to reset the array index.
private function hasSingleTagInsideElement(\DOMElement $node, string $tag): bool { $childNodes = iterator_to_array($node->childNodes); $children = array_filter($childNodes, fn ($childNode) => $childNode instanceof \DOMElement); $children = array_values($children); // There should be exactly 1 element child with given tag if (1 !== \count($children) || $children[0]->nodeName !== $tag) { return false; }
You might be right. If you can reproduce the bug with a given website & create a test, I'm happy to review the fix :)
Hi @j0k3r yes you can try with my website "https://agenciaweb.net" it's where I tried it and failed.
For example with all pages from golem.de
$url = 'https://www.golem.de/news/anzeige-varta-aa-batterien-grosses-set-zum-kleinen-preis-bei-amazon-2501-192329.html';
$graby = new Graby(); $result = $graby->fetchContent($url);
Looking at the code, the undefined key can only happen if there is single p element preceded by non-elements. Out of the node types:
- DOMElement: working as expected, will increase
count($children) - DOMAttr: cannot be placed outside of element, I don't think
- DOMText:
foowill get folded into an extra paragraph, increasingcount($children) - DOMCharacterData:
<![CDATA[foo]]>gets removed earlier for some reason - DOMEntityReference:
&foo;will be escaped and folded into an extra paragraph, increasingcount($children) - DOMEntity:
<!ENTITY foo "bar">will be stripped - DOMProcessingInstruction:
<?PITarget PIContent?>gets stripped earlier - DOMComment:
<!-- foo -->gets removed by tidy - DOMDocument: cannot be part of document?
- DOMDocumentType:
<!doctype html>gets stripped - DOMDocumentFragment: cannot be part of document, attempting to add it will transplant children
- DOMNotation: can only appear in DTD?
The only way this could crash I can come up with was to disable tidy and use a comment:
// This would fail on “Undefined array key 0” without tidy.
public function testDivSingleP(): void {
$readability = $this->getReadability('<div><!-- foo --><p>' . str_repeat('This is the awesome content. ', 7) . '</p></div>', 'http://0.0.0.0');
$res = $readability->init();
$this->assertTrue($res);
$this->assertInstanceOf(JSLikeHTMLElement::class, $readability->getContent());
$this->assertInstanceOf(JSLikeHTMLElement::class, $readability->getTitle());
$this->assertStringContainsString('<div readability=', $readability->getContent()->getInnerHtml());
$this->assertEmpty($readability->getTitle()->getInnerHtml());
$this->assertStringContainsString('This is the awesome content.', $readability->getContent()->getInnerHtml());
}
But then it just crashes elsewhere so I am not sure #97 could fix it:
TypeError: Readability\Readability::getAncestors(): Argument #1 ($node) must be of type DOMElement, DOMComment given, called in /home/jtojnar/Projects/php-readability/src/Readability.php on line 1022
/home/jtojnar/Projects/php-readability/src/Readability.php:1444
/home/jtojnar/Projects/php-readability/src/Readability.php:1022
/home/jtojnar/Projects/php-readability/src/Readability.php:244
/home/jtojnar/Projects/php-readability/tests/ReadabilityTest.php:119