zend-feed icon indicating copy to clipboard operation
zend-feed copied to clipboard

Consider using DOMDocument recovery mode

Open Isinlor opened this issue 6 years ago • 10 comments

See stack overflow for details: https://stackoverflow.com/a/9281963/893222

The idea is to handle malformed XML thanks to recovery option in libxml that is implemented in userland:

$dom = new DOMDocument();
$dom->recover = TRUE;

Isinlor avatar May 07 '18 21:05 Isinlor

The problem is that false results could lead to subsequent errors in parsing and handling of the entire feed.

Maybe it's an option to inject your own DOMDocument or a decorator clean up / recover the feeds before they are parsed by the reader.

froschdesign avatar Oct 05 '18 06:10 froschdesign

You can use the recovery mode yourself:

// Import by URI
$httpClient = Zend\Feed\Reader\Reader::getHttpClient();
$response   = $httpClient->get(
    'https://github.com/zendframework/zend-feed/releases.atom'
);
$xmlString  = $response->getBody();

// Create DOMDocument
$dom          = new DOMDocument;
$dom->recover = true;
$dom->loadXML(trim($xmlString));

// Detect type
$type = Zend\Feed\Reader\Reader::detectType($dom);

// Create reader
if (0 === strpos($type, 'rss')) {
    $reader = new Zend\Feed\Reader\Feed\Rss($dom, $type);
}
if (0 === strpos($type, 'atom')) {
    $reader = new Zend\Feed\Reader\Feed\Atom($dom, $type);
}

var_dump($reader->getTitle()); // "Release notes from zend-feed"

froschdesign avatar Mar 04 '19 21:03 froschdesign

Thanks for help! This is indeed what I ended up doing: https://gitlab.com/DeepRSS/Reader/blob/3667b1b10c11b9c067de1e3242f15eaf2a1de261/src/Core/Service/ZendReader/FeedParser.php#L35

Isinlor avatar Mar 04 '19 23:03 Isinlor

@Isinlor Thanks for the fast response! 👍

Can you provide a link to a feed which is malformed and needs the recovery mode?

froschdesign avatar Mar 05 '19 06:03 froschdesign

Here is one example: http://itbrokeand.ifixit.com/atom.xml

Code I used for testing:

<?php

$libxmlErrflag = libxml_use_internal_errors(true);
$oldValue = libxml_disable_entity_loader(true);

$dom = new \DOMDocument;
//$dom->recover = true; // Allows to parse slightly malformed feeds

$status = $dom->loadXML(file_get_contents("http://itbrokeand.ifixit.com/atom.xml"));

if (!$status) {

    // Build error message
    $error = libxml_get_last_error();
    if ($error instanceof \LibXMLError && $error->message != '') {
        $error->message = trim($error->message);
        $errormsg = "DOMDocument cannot parse XML: {$error->message}";
    } else {
        $errormsg = "DOMDocument cannot parse XML: Please check the XML document's validity";
    }

    throw new Exception($errormsg);
}

Isinlor avatar Mar 05 '19 11:03 Isinlor

@Isinlor Perfect, this helps a lot. I collect various problems to create some test scenarios.

froschdesign avatar Mar 05 '19 12:03 froschdesign

I think your initial reaction was correct.

The problem is that false results could lead to subsequent errors in parsing and handling of the entire feed.

I missed it when I was working on it myself. But indeed, even tough $dom->recover = true; seems to work, Zend Feed is not able to handle it correctly.

I'm really curious how Firefox handle it, because I have no issues if I open:

  • https://blog.noredink.com/rss
  • http://itbrokeand.ifixit.com/atom.xml
  • http://aasnova.org/feed/
  • https://blog.floydhub.com/rss/

Isinlor avatar Mar 05 '19 12:03 Isinlor

@Isinlor I will check all links this evening and will give a feedback.

froschdesign avatar Mar 05 '19 13:03 froschdesign

@Isinlor

https://blog.noredink.com/rss

There were some problems, but now I have not found anything.

http://itbrokeand.ifixit.com/atom.xml

Problem is <title>Web Operations D&D</title> and therefore not well-formed. Should be reported at ifixit.com. Everything else means ugly replacements.

(Also fails in a browser.)

http://aasnova.org/feed/

Two problems: 403 and wrong header.

(Also fails in a browser. [Download])

https://blog.floydhub.com/rss/

Many feeds contain characters out of the legal range.

Try the following preg_replace:

preg_replace(
    '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u',
    ' ',
    $string
)

This should eliminate problems like "CData section not finished".

(Also fails in a browser.)


Thanks for the examples. At the moment I do not know if we should do something in zend-feed, because it opens the door to many pitfalls or ugly workarounds. I see the benefit for the user but also the problem of maintain.

I remain open to suggestions and improvements.

froschdesign avatar Mar 12 '19 22:03 froschdesign

This repository has been closed and moved to laminas/laminas-feed; a new issue has been opened at https://github.com/laminas/laminas-feed/issues/8.

weierophinney avatar Dec 31 '19 21:12 weierophinney