php-goose
php-goose copied to clipboard
getCleanedArticleText returning NULL
I was testing this library and alot of website return NULL. Here is an example:
https://www.aljazeera.com/news/2019/09/military-base-italian-military-convoy-attacked-somalia-190930102422698.html
Same here
same to me
Same. Here is few example where it detected only title, but didn't detect main article html and cleaned article text.
- https://www.believeintherun.com/2019/06/06/nike-pegasus-36-performance-review/
- https://www.jackrabbit.com/info/blog/nike-pegasus-36-review
I'm having the same issue. Any idea of a workaround or fix?
I'm also having this issue. I think it's because setCleanedArticleText() in Article.php is never being properly initiated. https://github.com/scotteh/php-goose/blob/master/src/Article.php
I believe this also means getPopularWords() will be blank. It uses $this->article()->getCleanedArticleText();. https://github.com/scotteh/php-goose/blob/master/src/Modules/Extractors/AdditionalDataExtractor.php
I made some edits to a couple files, php-goose/src/Modules/Extractors/AdditionalDataExtractor.php and php-goose/src/Modules/Cleaners/DocumentCleaner.php.
Inside the DocumentCleaner class (DocumentCleaner.php) I added:
public function getDocument(){
return $this->document();
}
And in AdditionalDataExtractor.php, I made all the private functions public. getTags(), getVideos(), getLinks(), getPopularWords(). With getLinks though I got a fatal error:
Fatal error: Uncaught Error: Call to a member function parent() on null in /home/[...]/public_html/composer/vendor/scotteh/php-goose/src/Modules/Extractors/AdditionalDataExtractor.php:129 Stack trace: #0 /home/[...]/public_html/scripts/test2.php(21): Goose\Modules\Extractors\AdditionalDataExtractor->getLinks() #1 {main} thrown in /home/[...]/public_html/composer/vendor/scotteh/php-goose/src/Modules/Extractors/AdditionalDataExtractor.php on line 129
I think it has something to do with getTopNode() in ContentExtractor.php, but I wasn't able to figure it out. https://github.com/scotteh/php-goose/blob/master/src/Modules/Extractors/ContentExtractor.php
I made a test file with the code below. The getLinks() is commented out, and filling tags and videos and allImages doesn't seem to work. However, getting the article content and popular words (entities) and image does work.
If anyone has any further insights please share them. Let me know if you find this useful.
<?php
require_once("/path/to/public_html/composer/vendor/autoload.php");
use Goose\Client as GooseClient;
use Goose\Configuration;
use Goose\Modules\Cleaners\DocumentCleaner;
use Goose\Modules\Extractors\AdditionalDataExtractor;
$goose = new GooseClient();
$article = $goose->extractContent('https://www.wordpress.org');
$config = new Configuration();
$dc = new DocumentCleaner($config);
$dc->run($article);
$cleanedText = $dc->getDocument()->text();
$extractor = new AdditionalDataExtractor($config);
$extractor->run($article);
$article->setCleanedArticleText($cleanedText);
$article->setTags($extractor->getTags());
//$article->setLinks($extractor->getLinks());
$article->setVideos($extractor->getVideos());
$article->setPopularWords($extractor->getPopularWords());
$goose_array = array();
$goose_array['title'] = $article->getTitle();
$goose_array['metaDescription'] = $article->getMetaDescription();
$goose_array['metaKeywords'] = $article->getMetaKeywords();
$goose_array['canonicalLink'] = $article->getCanonicalLink();
$goose_array['domain'] = $article->getDomain();
$goose_array['tags'] = $article->getTags();
$goose_array['links'] = $article->getLinks();
$goose_array['videos'] = $article->getVideos();
$goose_array['articleText'] = $article->getCleanedArticleText();
$goose_array['entities'] = $article->getPopularWords();
$goose_array['image'] = $article->getTopImage();
$goose_array['allImages'] = $article->getAllImages();
print_r($goose_array);
?>