php-readability Keep URLS into documents

Hi, I have tried my best to adapt the given code in order to keep urls into the document and neither put them into footnotes nor removing them. Is there anybody that can lead my feet doing that?

Thank you :)

Sep 11 '17 03:09 boulama

Could you share a piece of code and what you want it to achieve?

Sep 11 '17 07:09 j0k3r

Well, it is Readability.php. I think the function is clean(). Now if I have this content code in a HTML page: <p>This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" /></p>.

I want the function to return This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" />.

Instead, now it returns this: This is my HTML code and this is a link and this is an image.

Sep 11 '17 10:09 boulama

url and image shouldn't be removed. Could you share your real example so I can try to reproduce what you said?

Sep 11 '17 11:09 j0k3r

For example, this URL: http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php contains some h3 and some a, the main goal is to keep them (as well as the formatting) and return a html document not return a plain text file.

Sep 11 '17 11:09 boulama

Using that example, I can't get content from that website :confused:

<?php

require 'vendor/autoload.php';

use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);

$logger = new Logger('log');
$logger->pushHandler(new StreamHandler(fopen('php://stderr', 'a+')));

$readability = new Readability($html, $url);
$readability->setLogger($logger);
$result = $readability->init();

if ($result) {
    var_export($readability->getContent()->ownerDocument->saveXML($readability->getContent()));
    die();
} else {
    echo "Looks like we couldn't find the content. :(\n";
}

Did you?

Sep 11 '17 12:09 j0k3r

I use this code, the one in the repository.

<?php
require_once '../Readability.php';
header('Content-Type: text/html; charset=utf-8');

$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);

if (function_exists('tidy_parse_string')) {
	$tidy = tidy_parse_string($html, array(), 'UTF8');
	$tidy->cleanRepair();
	$html = $tidy->value;
}


$readability = new Readability($html, $url);

$readability->debug = true;

$readability->convertLinksToFootnotes = true;

$result = $readability->init();

if ($result) {
	echo "== Title =====================================\n";
	echo $readability->getTitle()->textContent, "\n\n";
	echo "== Body ======================================\n";
	$content = $readability->getContent()->textContent;
	

	echo($content);
} else {
	echo 'Looks like we couldn\'t find the content. :(';
}
?>

Sep 11 '17 12:09 boulama

Thanks! This is what I asked since my first question. And look like I still can get the content, your script like mine display "Looks like we couldn't find the content. :("

Sep 11 '17 12:09 j0k3r

I will try to find something. The main idea is to keep the HTML of the content with the original html tags such as links, images, eventually paragraphs, bolds, etc etc

Sep 12 '17 02:09 boulama

Hello there, is this issue still relevant with current master?

Feb 15 '22 23:02 Kdecherf

php-readability php-readability copied to clipboard

Keep URLS into documents

php-readability
php-readability copied to clipboard