zend-feed
zend-feed copied to clipboard
Consider using DOMDocument recovery mode
See stack overflow for details: https://stackoverflow.com/a/9281963/893222
The idea is to handle malformed XML thanks to recovery option in libxml that is implemented in userland:
$dom = new DOMDocument();
$dom->recover = TRUE;
The problem is that false results could lead to subsequent errors in parsing and handling of the entire feed.
Maybe it's an option to inject your own DOMDocument
or a decorator clean up / recover the feeds before they are parsed by the reader.
You can use the recovery mode yourself:
// Import by URI
$httpClient = Zend\Feed\Reader\Reader::getHttpClient();
$response = $httpClient->get(
'https://github.com/zendframework/zend-feed/releases.atom'
);
$xmlString = $response->getBody();
// Create DOMDocument
$dom = new DOMDocument;
$dom->recover = true;
$dom->loadXML(trim($xmlString));
// Detect type
$type = Zend\Feed\Reader\Reader::detectType($dom);
// Create reader
if (0 === strpos($type, 'rss')) {
$reader = new Zend\Feed\Reader\Feed\Rss($dom, $type);
}
if (0 === strpos($type, 'atom')) {
$reader = new Zend\Feed\Reader\Feed\Atom($dom, $type);
}
var_dump($reader->getTitle()); // "Release notes from zend-feed"
Thanks for help! This is indeed what I ended up doing: https://gitlab.com/DeepRSS/Reader/blob/3667b1b10c11b9c067de1e3242f15eaf2a1de261/src/Core/Service/ZendReader/FeedParser.php#L35
@Isinlor Thanks for the fast response! 👍
Can you provide a link to a feed which is malformed and needs the recovery mode?
Here is one example: http://itbrokeand.ifixit.com/atom.xml
Code I used for testing:
<?php
$libxmlErrflag = libxml_use_internal_errors(true);
$oldValue = libxml_disable_entity_loader(true);
$dom = new \DOMDocument;
//$dom->recover = true; // Allows to parse slightly malformed feeds
$status = $dom->loadXML(file_get_contents("http://itbrokeand.ifixit.com/atom.xml"));
if (!$status) {
// Build error message
$error = libxml_get_last_error();
if ($error instanceof \LibXMLError && $error->message != '') {
$error->message = trim($error->message);
$errormsg = "DOMDocument cannot parse XML: {$error->message}";
} else {
$errormsg = "DOMDocument cannot parse XML: Please check the XML document's validity";
}
throw new Exception($errormsg);
}
@Isinlor Perfect, this helps a lot. I collect various problems to create some test scenarios.
I think your initial reaction was correct.
The problem is that false results could lead to subsequent errors in parsing and handling of the entire feed.
I missed it when I was working on it myself. But indeed, even tough $dom->recover = true;
seems to work, Zend Feed is not able to handle it correctly.
I'm really curious how Firefox handle it, because I have no issues if I open:
- https://blog.noredink.com/rss
- http://itbrokeand.ifixit.com/atom.xml
- http://aasnova.org/feed/
- https://blog.floydhub.com/rss/
@Isinlor I will check all links this evening and will give a feedback.
@Isinlor
https://blog.noredink.com/rss
There were some problems, but now I have not found anything.
http://itbrokeand.ifixit.com/atom.xml
Problem is <title>Web Operations D&D</title>
and therefore not well-formed. Should be reported at ifixit.com. Everything else means ugly replacements.
(Also fails in a browser.)
http://aasnova.org/feed/
Two problems: 403 and wrong header.
(Also fails in a browser. [Download])
https://blog.floydhub.com/rss/
Many feeds contain characters out of the legal range.
Try the following preg_replace
:
preg_replace(
'/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u',
' ',
$string
)
This should eliminate problems like "CData section not finished".
(Also fails in a browser.)
Thanks for the examples. At the moment I do not know if we should do something in zend-feed, because it opens the door to many pitfalls or ugly workarounds. I see the benefit for the user but also the problem of maintain.
I remain open to suggestions and improvements.
This repository has been closed and moved to laminas/laminas-feed; a new issue has been opened at https://github.com/laminas/laminas-feed/issues/8.