php-readability
php-readability copied to clipboard
Keep URLS into documents
Hi, I have tried my best to adapt the given code in order to keep urls into the document and neither put them into footnotes nor removing them. Is there anybody that can lead my feet doing that?
Thank you :)
Could you share a piece of code and what you want it to achieve?
Well, it is Readability.php.
I think the function is clean().
Now if I have this content code in a HTML page:
<p>This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" /></p>.
I want the function to return
This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" />.
Instead, now it returns this:
This is my HTML code and this is a link and this is an image.
url and image shouldn't be removed. Could you share your real example so I can try to reproduce what you said?
For example, this URL: http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php contains some h3 and some a, the main goal is to keep them (as well as the formatting) and return a html document not return a plain text file.
Using that example, I can't get content from that website :confused:
<?php
require 'vendor/autoload.php';
use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);
$logger = new Logger('log');
$logger->pushHandler(new StreamHandler(fopen('php://stderr', 'a+')));
$readability = new Readability($html, $url);
$readability->setLogger($logger);
$result = $readability->init();
if ($result) {
var_export($readability->getContent()->ownerDocument->saveXML($readability->getContent()));
die();
} else {
echo "Looks like we couldn't find the content. :(\n";
}
Did you?
I use this code, the one in the repository.
<?php
require_once '../Readability.php';
header('Content-Type: text/html; charset=utf-8');
$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, array(), 'UTF8');
$tidy->cleanRepair();
$html = $tidy->value;
}
$readability = new Readability($html, $url);
$readability->debug = true;
$readability->convertLinksToFootnotes = true;
$result = $readability->init();
if ($result) {
echo "== Title =====================================\n";
echo $readability->getTitle()->textContent, "\n\n";
echo "== Body ======================================\n";
$content = $readability->getContent()->textContent;
echo($content);
} else {
echo 'Looks like we couldn\'t find the content. :(';
}
?>
Thanks! This is what I asked since my first question. And look like I still can get the content, your script like mine display "Looks like we couldn't find the content. :("
I will try to find something. The main idea is to keep the HTML of the content with the original html tags such as links, images, eventually paragraphs, bolds, etc etc
Hello there, is this issue still relevant with current master?