php-readability icon indicating copy to clipboard operation
php-readability copied to clipboard

Keep URLS into documents

Open boulama opened this issue 8 years ago • 9 comments

Hi, I have tried my best to adapt the given code in order to keep urls into the document and neither put them into footnotes nor removing them. Is there anybody that can lead my feet doing that?

Thank you :)

boulama avatar Sep 11 '17 03:09 boulama

Could you share a piece of code and what you want it to achieve?

j0k3r avatar Sep 11 '17 07:09 j0k3r

Well, it is Readability.php. I think the function is clean(). Now if I have this content code in a HTML page: <p>This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" /></p>.

I want the function to return This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" />.

Instead, now it returns this: This is my HTML code and this is a link and this is an image.

boulama avatar Sep 11 '17 10:09 boulama

url and image shouldn't be removed. Could you share your real example so I can try to reproduce what you said?

j0k3r avatar Sep 11 '17 11:09 j0k3r

For example, this URL: http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php contains some h3 and some a, the main goal is to keep them (as well as the formatting) and return a html document not return a plain text file.

boulama avatar Sep 11 '17 11:09 boulama

Using that example, I can't get content from that website :confused:

<?php

require 'vendor/autoload.php';

use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);

$logger = new Logger('log');
$logger->pushHandler(new StreamHandler(fopen('php://stderr', 'a+')));

$readability = new Readability($html, $url);
$readability->setLogger($logger);
$result = $readability->init();

if ($result) {
    var_export($readability->getContent()->ownerDocument->saveXML($readability->getContent()));
    die();
} else {
    echo "Looks like we couldn't find the content. :(\n";
}

Did you?

j0k3r avatar Sep 11 '17 12:09 j0k3r

I use this code, the one in the repository.

<?php
require_once '../Readability.php';
header('Content-Type: text/html; charset=utf-8');

$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);

if (function_exists('tidy_parse_string')) {
	$tidy = tidy_parse_string($html, array(), 'UTF8');
	$tidy->cleanRepair();
	$html = $tidy->value;
}


$readability = new Readability($html, $url);

$readability->debug = true;

$readability->convertLinksToFootnotes = true;

$result = $readability->init();

if ($result) {
	echo "== Title =====================================\n";
	echo $readability->getTitle()->textContent, "\n\n";
	echo "== Body ======================================\n";
	$content = $readability->getContent()->textContent;
	

	echo($content);
} else {
	echo 'Looks like we couldn\'t find the content. :(';
}
?>

boulama avatar Sep 11 '17 12:09 boulama

Thanks! This is what I asked since my first question. And look like I still can get the content, your script like mine display "Looks like we couldn't find the content. :("

j0k3r avatar Sep 11 '17 12:09 j0k3r

I will try to find something. The main idea is to keep the HTML of the content with the original html tags such as links, images, eventually paragraphs, bolds, etc etc

boulama avatar Sep 12 '17 02:09 boulama

Hello there, is this issue still relevant with current master?

Kdecherf avatar Feb 15 '22 23:02 Kdecherf