php-goose icon indicating copy to clipboard operation
php-goose copied to clipboard

getCleanedArticleText returning NULL

Open kaymly opened this issue 5 years ago • 5 comments

I was testing this library and alot of website return NULL. Here is an example:

https://www.aljazeera.com/news/2019/09/military-base-italian-military-convoy-attacked-somalia-190930102422698.html

kaymly avatar Oct 01 '19 20:10 kaymly

Same here

seuaCoder avatar Oct 02 '19 08:10 seuaCoder

same to me

folkevil avatar Oct 24 '19 08:10 folkevil

Same. Here is few example where it detected only title, but didn't detect main article html and cleaned article text.

  1. https://www.believeintherun.com/2019/06/06/nike-pegasus-36-performance-review/
  2. https://www.jackrabbit.com/info/blog/nike-pegasus-36-review

swash13 avatar Oct 30 '19 10:10 swash13

I'm having the same issue. Any idea of a workaround or fix?

KW4NP avatar Jan 01 '22 12:01 KW4NP

I'm also having this issue. I think it's because setCleanedArticleText() in Article.php is never being properly initiated. https://github.com/scotteh/php-goose/blob/master/src/Article.php

I believe this also means getPopularWords() will be blank. It uses $this->article()->getCleanedArticleText();. https://github.com/scotteh/php-goose/blob/master/src/Modules/Extractors/AdditionalDataExtractor.php

I made some edits to a couple files, php-goose/src/Modules/Extractors/AdditionalDataExtractor.php and php-goose/src/Modules/Cleaners/DocumentCleaner.php.

Inside the DocumentCleaner class (DocumentCleaner.php) I added:

public function getDocument(){
	return $this->document();
}

And in AdditionalDataExtractor.php, I made all the private functions public. getTags(), getVideos(), getLinks(), getPopularWords(). With getLinks though I got a fatal error:

Fatal error: Uncaught Error: Call to a member function parent() on null in /home/[...]/public_html/composer/vendor/scotteh/php-goose/src/Modules/Extractors/AdditionalDataExtractor.php:129 Stack trace: #0 /home/[...]/public_html/scripts/test2.php(21): Goose\Modules\Extractors\AdditionalDataExtractor->getLinks() #1 {main} thrown in /home/[...]/public_html/composer/vendor/scotteh/php-goose/src/Modules/Extractors/AdditionalDataExtractor.php on line 129

I think it has something to do with getTopNode() in ContentExtractor.php, but I wasn't able to figure it out. https://github.com/scotteh/php-goose/blob/master/src/Modules/Extractors/ContentExtractor.php

I made a test file with the code below. The getLinks() is commented out, and filling tags and videos and allImages doesn't seem to work. However, getting the article content and popular words (entities) and image does work.

If anyone has any further insights please share them. Let me know if you find this useful.

<?php

require_once("/path/to/public_html/composer/vendor/autoload.php");

use Goose\Client as GooseClient;
use Goose\Configuration;
use Goose\Modules\Cleaners\DocumentCleaner;
use Goose\Modules\Extractors\AdditionalDataExtractor;

$goose = new GooseClient();
$article = $goose->extractContent('https://www.wordpress.org');
$config = new Configuration();
$dc = new DocumentCleaner($config);
$dc->run($article);
$cleanedText = $dc->getDocument()->text();

$extractor = new AdditionalDataExtractor($config);
$extractor->run($article);
$article->setCleanedArticleText($cleanedText);
$article->setTags($extractor->getTags());
//$article->setLinks($extractor->getLinks());
$article->setVideos($extractor->getVideos());
$article->setPopularWords($extractor->getPopularWords());

$goose_array = array();
$goose_array['title'] = $article->getTitle();
$goose_array['metaDescription'] = $article->getMetaDescription();
$goose_array['metaKeywords'] = $article->getMetaKeywords();
$goose_array['canonicalLink'] = $article->getCanonicalLink();
$goose_array['domain'] = $article->getDomain();
$goose_array['tags'] = $article->getTags();
$goose_array['links'] = $article->getLinks();
$goose_array['videos'] = $article->getVideos();
$goose_array['articleText'] = $article->getCleanedArticleText();
$goose_array['entities'] = $article->getPopularWords();
$goose_array['image'] = $article->getTopImage();
$goose_array['allImages'] = $article->getAllImages();

print_r($goose_array);
?>

solkad avatar Jun 04 '22 07:06 solkad