natural
natural copied to clipboard
Classification weight
We are working on a project where we need to weigh different documents being added to a classifier. For example if I have a title and description I am adding I want the title to carry more weight. Any thoughts? Thank you.
Same issue here, so +1
Hi, I'm trying to understand the question you have posted here. Are you talking about a document of the form:
doc = {
title: 'Some Title',
description: 'We are working on a project where we need to weigh different documents being added to a classifier. For example if I have a title and description I am adding I want the title to carry more weight.'
};
If not, how would the title be distinguished from the description of the document?
You could repeat the title 10 times (or however much you want to increase the weight of the title. The classifier just counts things so increase the count of things you want to give more weight.
I've found the current weighing unreliable at best... Somehow it's hard to believe that:
classifier.addDocument(['callback', 'hell', 'npm', 'thenable', 'promise'], 'node')
classifier.addDocument(['collections', 'database', 'db', 'mongo', 'mongodb', 'MongoDb', 'ObjectId'], 'database')
classifier.train()
console.log(classifier.classify('What a bunch of users collections'))
Will get classified as node
... que? This seems very unlikely correct but I keep getting these awkward results back time and time again...
Even trimming it down:
classifier.addDocument(['callback', 'hell'], 'node')
classifier.addDocument(['collections', 'db'], 'database')
classifier.train()
console.log(classifier.classify('What a bunch of users collections'))
Happily returns node
. I must have broken something somewhere? For one, weighing would allow me to inspect the relative importance of the term and tweak it so that it does provide (better|correct) results.
I have exactly the same problems. I have noticed that using getClassifications all results have identical "value" response.
In your example, i got the same behaviour (using LogisticRegressionClassifier, but same for BayesClassifier where value is 0.6666666666666666) :
[ { label: 'node', value: 0.5 },
{ label: 'database', value: 0.5 } ]
The algorithm start to work only if it has at least "two pattern", like here :
classifier.getClassifications('What a bunch of users collections db, hell!')
Result is :
[ { label: 'database', value: 0.7072773051736339 },
{ label: 'node', value: 0.29272269482636626 } ]
So to "prevent" this case, ~~i check if value is 0.5 for all results, in which case, i try another way... but yeah it's ugly hack... would be great if devs figures out this case.~~ read next comment!
wow @KyleAMathews solution kinda works.
classifier.addDocument(['callback', 'hell', 'npm', 'thenable', 'promise'], 'node')
classifier.addDocument(['collections', 'database', 'db', 'mongo', 'mongodb', 'MongoDb', 'ObjectId'], 'database')
classifier.addDocument(['collections'], 'database')
classifier.train();
console.log('result:', classifier.getClassifications('What a bunch of users collections'))
Result :
[ { label: 'database', value: 0.75 },
{ label: 'node', value: 0.5 } ]
Notes: the weigth is took in account only when you readd the pattern in the document. So the following code wont work :
classifier.addDocument(['collections', 'collections', ...], 'database')