sentiment
sentiment copied to clipboard
Allow token processing "middleware"
Hi,
It is possible to allow for a option which first finds string distances to words in the positive/negative list, and then, if it is above some threshold, categorize it as that word so spelling mistakes and/casual writing style are not lost.
e.g.
> sentiment('Cats are dumb');
{ score: -3,
comparative: -1,
tokens: [ 'cats', 'are', 'dumb' ],
words: [ 'dumb' ],
positive: [],
negative: [ 'dumb' ] }
> sentiment('Cats are dumbbb');
{ score: 0,
comparative: 0,
tokens: [ 'cats', 'are', 'dumbbb' ],
words: [],
positive: [],
negative: [] }
This example dumbbb is so close to dumb that it should be classified as such. Using a library like natural makes this easy.
require('natural').JaroWinklerDistance('dumb', 'dumbbb')
0.9333333333333333
If adding natural is out of scope, maybe a way that someone could inject it in some processing step could work too.
What do you think? Would this work?
Good question! Using edit distance for matching is a really interesting use case. I'm going to modify your title to make this a little more generic, but this is certainly something I'd be interested in supporting.
This is exactly what I'm seeing as well with the casual comments and expression social media. +1 for this 👍
It looks very easy to add. Here just allow for an optional callback that supports something like:
function middleware(text, value, wasNegated, afinn) {
if(value !== 0) return value; // I can easily modify affinity here
// do search on afinn here for closest word
return (afinn[closest] || 0) * (wasNegated ? -1 : 1); // Don't really write code like this
}
This will allow for a range of middleware that could do things like chain to apply different techniques if a simpler or faster one fails to work
A different approach of this, could be filter the words with a spellchecker like this https://github.com/atom/node-spellchecker I dont know if this could attempt with the benchmarks but in my case would be great in order to apply another filters like gender guessing and topic classification.
Cheers!
Hi, I took the liberty of forking this great repo to add some features I needed, and they go inline with whats described in this issue.
I added node-spellchecker to check for typos, and also "levenshtein" to find the closest spell correction to the original word.
I also modified the "negation" feature to look backwards until a negation word or a new afinn word is found, to cover cases like "not too bad".
Feel free to check the master branch on https://github.com/AmbitAI/sentiment
Im happy to create a PR with part of the changes or the whole thing, depending on whats in line with the direction of the library.
@nsantini I think I would be interested in all three of these features as long as they were added in a way that is optional (as to preserve performance for those who need it). Curious to see how each of these features impact the validation tests (make validate).
@thisandagain Is this still issue still open? @nsantini Did you still want to create a PR with this feature? Let me know if if it would be helpful pitching in.
@pdw207 This is still open pending a PR. I'd be happy to review a PR from you if you want to pickup where @nsantini left off.
PR https://github.com/thisandagain/sentiment/pull/144
It has some merge conflicts. I'll try go solve them, but feel free to take over, I havent looked into this for a while
solved the merge conflicts, but somebody more familiar with the changes that happened since I forked need to validate them :)
So, the sync_negation test case is failing. Looks like since I forked the logic to deal with negation of sentence has changed, so my change is double negating the score. But not sure where to look for the new logic