ML icon indicating copy to clipboard operation
ML copied to clipboard

Blog post Tag prediction/recommendation?

Open elfeffe opened this issue 3 years ago • 6 comments

I want to recommend tags for my blog posts, I plan to have the text, and I need to receive tags. Any recommendation about where to begin? Any example that I can see?

elfeffe avatar Dec 30 '20 22:12 elfeffe

I know you're not directly asking a blog or WordPress question but I can address at least the immediate issue. There are SEO plugins for WordPress that can help with keyword recommendation. This could be used for tags as well.

I'm also curious how something like this would be implemented. The WordPress side would be easy once the tags were available.

Something to keep in mind here is this would also require some server knowledge to install and setup all the necessary dependencies.

carmelosantana avatar Dec 30 '20 22:12 carmelosantana

Hey @elfeffe that's a great use case

The way that you'd approach the problem with machine learning would be to (either yourself or someone else) start labeling a portion of the blog posts in your database with tags by hand. Pair each sample (which may include things like the title and body of the post) with a unique tag (you will duplicate samples for posts with multiple tags). For however many samples you want to self-annotate, this will be your dataset. I would recommend setting aside about 20% of the data for cross-validation. The bigger your dataset, the better your results will be.

When inferring tags of new blog posts, you'll call the proba() method on a Probabilistic estimator to output the probabilities of every possible tag, sort them by their probability and take the top k above a threshold as the inferred tags for example. The lower this threshold the more tags you'll obtain but they may be junk if set too low.

The Sentiment example, is a good reference for natural language problems such as this one. The task is different, but many of the preprocessing steps are the same.

andrewdalpino avatar Dec 31 '20 00:12 andrewdalpino

Great, thank you, I will check that. This is not for a WP blog, and the main idea is to learn how to use this library. We have 20000 posts with multiple tags (added by hand). I will check how to begin. Thank you guys. Happy new year.

elfeffe avatar Jan 01 '21 14:01 elfeffe

@andrewdalpino will be useful to remove common words (for, from, at, the) and remove accents (from Spanish words). Or it’s useless?

elfeffe avatar Jan 06 '21 14:01 elfeffe

Ok. That’s the tdidf.

elfeffe avatar Jan 06 '21 22:01 elfeffe

@andrewdalpino will be useful to remove common words (for, from, at, the) and remove accents (from Spanish words). Or it’s useless?

That's a good question, and I'm not sure there's a good answer for that except to do some experiments to see which works best for the data you have. To remove common words (a.k.a. stop words) you can try a couple of different strategies. You can use the parameter max document frequency on Word Count Vectorizer to bar stop words from entering the vocabulary, or you can filter stop words from the dataset before tokenizing the blobs using Stop Word Filter.

https://docs.rubixml.com/en/latest/transformers/word-count-vectorizer.html

https://docs.rubixml.com/en/latest/transformers/stop-word-filter.html

Ok. That’s the tdidf.

Not quite, but your intuition is good. Term Frequency - Inverse Document Frequency (TF-IDF) is a weighting scheme applied to the raw term counts created by Word Count Vectorizer such that common words are given less weight than rarer words. This is slightly different from removing the token completely from the bag of words in that some weight is still given to the occurrences.

andrewdalpino avatar Jan 10 '21 01:01 andrewdalpino