ML News article category prediction

Hi there 👋🏽

I am trying to create a classification model that accepts multiple categories. The idea is to be able to predict the category for a news article. Consider I have the below categories:

Technology
World
Sports
Politics
Entertainment
Automobile
Science

For this, I have a small(ish) dataset containing around 5000 news articles that are already labelled. The dataset is as the below:

dataset.csv:

headline               |  article          | category
Lorem ipsum            |  Lorem ipsum ..... | technology
Lorem ipsum            |  Lorem ipsum ..... | politics
Lorem ipsum            |  Lorem ipsum ..... | sports
Lorem ipsum            |  Lorem ipsum ..... | entertainment
...

This is my progress so far:

$extractor = new ColumnPicker(new CSV('dataset.csv', true), [
    'article', 'category'
]);

$dataset = \Rubix\ML\Datasets\Labeled::fromIterator($extractor);
[$training, $testing] = $dataset->stratifiedSplit(0.8);

$estimator = new \Rubix\ML\Classifiers\NaiveBayes();

$estimator->train($training);

return $estimator->predictSample([
    'The New York Giants won their series with a thin margin. But they are now ready for their next challenge!'
]);

The above prints out:

world

Actually, changing the text to anything else and trying to predict on that, everything returns world. If I use probaSample(), I see that the probability score doesn't change even though the text sample does.

So it seems like I am doing something wrong. I was hoping I would be able to get some help getting further in my ML journey 😃

Feb 18 '21 18:02 oliverbj

I would say using a return statement while you are not in either a class or function stands out to me. Can you assign it and then create an object dump to see what's in the object?

Feb 18 '21 18:02 AtomLaw

I would say using a return statement while you are not in either a class or function stands out to me. Can you assign it and then create an object dump to see what's in the object?

If I change it to use probaSample, assign it to a variable and die dump it (dd as I am using Laravel), I get the following:

array:7 [▼
  "technology" => 0.15586099585062
  "sports" => 0.17764522821577
  "world" => 0.211877593361
  "politics" => 0.11332987551867
  "entertainment" => 0.20720954356847
  "automobile" => 0.053163900414938
  "science" => 0.080912863070539
]

Feb 18 '21 18:02 oliverbj

I'm just guessing here... If "world" is a part of your target category at this point, given that it sounds like its one of your labels then you model has predicted that its the likely option. have you hosted your project anywhere where it can be seen? If you are at liberty to do so?

Feb 18 '21 18:02 AtomLaw

I am just finding it odd, that no matter what input text I try to make the prediction on, the probability scores are the same. I can even write: Lalalala and I would still get world with a probability score of 0.211877593361.

The code I posted is the entire code I have written so far - I am just on a fresh Laravel installation

Feb 18 '21 18:02 oliverbj

Hi @oliverbj! Without any preprocessing of your features, each title and text blob will be considered a unique category. With nothing else to go by, your estimator is probably just predicting the class with the highest prior probability. Have you seen the section of the user guide regarding representing text features? After preprocessing your text blobs will (usually) be represented as continuous features (not categorical).

https://docs.rubixml.com/latest/representing-your-data.html#text

The Sentiment tutorial and example project are a great place to get an idea of how to preprocess text so it can be used with a Learner.

https://github.com/RubixML/Sentiment

Here's an example of how to preprocess the text columns of a dataset into weighted term frequency vectors.

use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TFIDFTransformer;

$dataset->apply(new TextNormalizer())
    ->apply(new WordCountVectorizer(10000))
    ->apply(new TFIDFTransformer());

Since you have multiple text blob columns in your dataset (one for title, one for the article), you'll most likely want to preprocess them individually for greater control over the representation of each column. For example, you might not need as large of a vocabulary for the title column as the article column. For that, you can check out the section on advanced preprocessing.

https://docs.rubixml.com/latest/preprocessing.html#advanced-preprocessing

You can also read up on the bag-of-words method for a more general idea of what's going on

https://en.wikipedia.org/wiki/Bag-of-words_model

Feb 18 '21 21:02 andrewdalpino

ML ML copied to clipboard

News article category prediction

ML
ML copied to clipboard