php-text-analysis
php-text-analysis copied to clipboard
Notice & Warning on lines 216, 217, 219 WordnetCorpus.php
I am trying out your awesome library and I found notices & warnings on lines 216, 217, 219 of php-text-analysis/src/corpus/WordnetCorpus.php
it happens when you call stem() with MorphStemmer class with wordnet corpus:
$stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);
Thanks for reporting the issue. I will check it out later this week.
Cheers,
On Tue, Oct 5, 2021, 5:17 AM Muhammad Mehroz Anjum @.***> wrote:
I am trying out your awesome library and I found notices & warnings on lines 216, 217, 219 of php-text-analysis/src/corpus/WordnetCorpus.php
it happens when you call stem() with MorphStemmer class with wordnet corpus: $stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yooper/php-text-analysis/issues/72, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAID6TKGVET4KBKKVIYRCWTUFK7CRANCNFSM5FLJGZ7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@mehroz1 , please can you provide a test case for me to recreate the issue.
Thanks,
@yooper Here run this test file its also using PHP-ML you can remove those lines and provide tokens on line 60:
<?php
ini_set("memory_limit", "-1");
set_time_limit(0);
require_once __DIR__ . '/vendor/autoload.php';
use Phpml\Tokenization\WordTokenizer;
use Phpml\FeatureExtraction\StopWords\English;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Preprocessing\Normalizer;
use TextAnalysis\Tokenizers\GeneralTokenizer;
function getWikipediaPage($page, $save_page=false) {
global $data_set_dir;
ini_set('user_agent', 'NlpToolsTest/1.0 ([email protected])');
if($save_page){
file_put_contents($data_set_dir."/".$page."/".$page.".txt", file_get_contents("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=".urlencode($page)));
}
$page = json_decode(file_get_contents("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=".urlencode($page)),true);
return preg_replace('/\s+/',' ',strip_tags($page['parse']['text']['*']));
}
function getDataFromFile($file_name = "./sample-data.txt", $processed = true){
if($processed==true){
$page = json_decode(file_get_contents("$file_name"),true);
return preg_replace('/\s+/',' ',strip_tags($page['parse']['text']['*']));
}else{
return file_get_contents($file_name);
}
}
global $page_name, $data_set_dir;
$page_name = "Aristotle";
$data_set_dir = "./data-sets"; # without trailing slash
if(!is_dir($data_set_dir)){
mkdir($data_set_dir);
}
if(!is_dir($data_set_dir."/".$page_name)){
mkdir($data_set_dir."/".$page_name);
}
$sample_text = $sample_text_ori= getWikipediaPage($page_name, true);
# $sample_text = $sample_text_ori = getDataFromFile($data_set_dir."/".$page_name."/".$page_name.".txt",true);
//print("<pre>".print_r($sample_text,true)."</pre>");
$tokenizer = new WordTokenizer();
$tokenized_sample_text = $tokenizer->tokenize($sample_text);
$vectorizer = new TokenCountVectorizer(new WordTokenizer, new English());
$vectorizer->fit($tokenized_sample_text);
$vectorized_text = $vectorizer->getVocabulary();
#print("<pre>".print_r($vectorized_text,true)."</pre>");
#exit();
# $tokens = tokenize($sample_text_ori); Text Analysis tokkenization
$normalizedTokens = normalize_tokens($vectorized_text);
# print("<pre>".print_r($normalizedTokens,true)."</pre>");
$stopWords = [
'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'aren\'t', 'as', 'at', 'be', 'because',
'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can\'t', 'cannot', 'could', 'couldn\'t', 'did', 'didn\'t',
'do', 'does', 'doesn\'t', 'doing', 'don\'t', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn\'t', 'has',
'hasn\'t', 'have', 'haven\'t', 'having', 'he', 'he\'d', 'he\'ll', 'he\'s', 'her', 'here', 'here\'s', 'hers', 'herself', 'him',
'himself', 'his', 'how', 'how\'s', 'i', 'i\'d', 'i\'ll', 'i\'m', 'i\'ve', 'if', 'in', 'into', 'is', 'isn\'t', 'it', 'it\'s', 'its',
'itself', 'let\'s', 'me', 'more', 'most', 'mustn\'t', 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or',
'other', 'ought', 'our', 'oursourselves', 'out', 'over', 'own', 'same', 'shan\'t', 'she', 'she\'d', 'she\'ll', 'she\'s', 'should',
'shouldn\'t', 'so', 'some', 'such', 'than', 'that', 'that\'s', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there',
'there\'s', 'these', 'they', 'they\'d', 'they\'ll', 'they\'re', 'they\'ve', 'this', 'those', 'through', 'to', 'too', 'under',
'until', 'up', 'very', 'was', 'wasn\'t', 'we', 'we\'d', 'we\'ll', 'we\'re', 'we\'ve', 'were', 'weren\'t', 'what', 'what\'s',
'when', 'when\'s', 'where', 'where\'s', 'which', 'while', 'who', 'who\'s', 'whom', 'why', 'why\'s', 'with', 'won\'t', 'would',
'wouldn\'t', 'you', 'you\'d', 'you\'ll', 'you\'re', 'you\'ve', 'your', 'yours', 'yourself', 'yourselves', 'a', 'abbr', 'b', 'bdi', 'br', 'col', 'dd', 'del', 'dfn', 'div', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'ins', 'kbd', 'li', 'ol', 'p', 'q', 'rb', 'rp', 'rt', 'rtc', 's', 'sup', 'td', 'th', 'tr', 'u', 'ul', 'li', 'var', 'wbr', 'px', 'st', 'a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', 'can\'t', 'cause', 'causes', 'certain', 'certainly', 'co', 'com', 'come', 'comes', 'contain', 'containing', 'contains', 'could','couldnt','couldn\'t', 'd', 'date', 'did', 'didn\'t', 'different', 'do', 'does', 'doesn\'t', 'doing', 'done', 'don\'t', 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', 'hasn\'t', 'have', 'haven\'t', 'having', 'he', 'hed', 'hence', 'her', 'here','hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'hundred', 'i', 'id', 'ie', 'if', 'i\'ll', 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', 'isn\'t', 'it', 'itd', 'it\'ll', 'its', 'itself', 'i\'ve' , 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', '\'ll', 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on','once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', 'she\'ll', 'shes', 'should', 'shouldn\'t', 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'that\'ll', 'thats', 'that\'ve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'there\'ll', 'thereof', 'therere', 'theres', 'thereto', 'thereupon', 'there\'ve', 'these', 'they', 'theyd', 'they\'ll', 'theyre', 'they\'ve', 'think', 'this', 'those', 'thou', 'though', 'thoughh', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under','unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', '\'ve', 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'wasn\'t', 'way', 'we', 'wed', 'welcome', 'we\'ll', 'went', 'were', 'werent', 'weren\'t', 'we\'ve', 'what', 'whatever', 'what\'ll', 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'who\'ll', 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt','wouldn\'t', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', 'you\'ll', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'you\'ve', 'z', 'zero'
];
$filters = array(
new \TextAnalysis\Filters\LowerCaseFilter(),
new \TextAnalysis\Filters\QuotesFilter(),
new \TextAnalysis\Filters\StripTagsFilter(),
new \TextAnalysis\Filters\TrimFilter(),
new \TextAnalysis\Filters\PunctuationFilter(),
new \TextAnalysis\Filters\QuotesFilter(),
new \TextAnalysis\Filters\SpacePunctuationFilter(),
new \TextAnalysis\Filters\WhitespaceFilter(),
new \TextAnalysis\Filters\NumbersFilter(),
new \TextAnalysis\Filters\DomainFilter(),
new \TextAnalysis\Filters\EmailFilter(),
new \TextAnalysis\Filters\CharFilter(),
new \TextAnalysis\Filters\StopWordsFilter($stopWords)
);
$document = new \TextAnalysis\Documents\TokensDocument($normalizedTokens);
$docCollection = new \TextAnalysis\Collections\DocumentArrayCollection(array($document));
$docCollection->applyTransformations($filters);
//print("<pre>".print_r($docCollection[0]->getDocumentData(),true)."</pre>");
$freqDist = new \TextAnalysis\Analysis\FreqDist($docCollection[0]->getDocumentData());
$frequency_keywords = $freqDist->getKeyValuesByFrequency();
file_put_contents($data_set_dir."/".$page_name."/".$page_name."-frequency-keywords.txt", json_encode($frequency_keywords));
//$top1000 = array_splice($frequency_keywords, 0, 1000);
# print("<pre>".print_r($top10,true)."</pre>");
foreach($frequency_keywords as $key => $single_keyword){
$top_keywords[] = (string)$key;
}
file_put_contents($data_set_dir."/".$page_name."/".$page_name."-keywords.txt", json_encode($top_keywords));
//print("<pre>".print_r($top_keywords,true)."</pre>");
$stemmedTokens = stem($top_keywords, \TextAnalysis\Stemmers\MorphStemmer::class);
file_put_contents($data_set_dir."/".$page_name."/".$page_name."-stemmed-tokens.txt", json_encode($stemmedTokens));
print("<pre>".print_r(array_filter( $stemmedTokens),true)."</pre>");
I am testing this library on PHP 8.0.11 and I solved this issue by using (int) on lines 216, 217, 219 of WordnetCorpus.php