gramophone icon indicating copy to clipboard operation
gramophone copied to clipboard

tf-idf support

Open mef opened this issue 11 years ago • 2 comments

Hi,

How would you advice to implement tf-idf inside gramophone ?

Good work, by the way.

mef avatar Oct 15 '13 15:10 mef

It's possible to process the documents with gramophone and then pass the results to natural's tf-idf function. See this https://gist.github.com/bxjx/7001437.

I've also added a { flatten: true} option to gramophone that should make this easier (version 0.0.3).

Let me know how you go! You may run into issues using stemming. If you do, post back and I can probably sort them out. The tf-idf code from natural is pretty straightforward and I could either pull it into gramophone or add a pull request to make it more friendly for using alternative tokenizers like gramophone.

Thanks for the props! This library is fairly niche and it's nice to know someone else might benefit.

bxjx avatar Oct 16 '13 02:10 bxjx

I have tested your example with natural 0.1.24 (tf-idf was broken in natural 0.1.23). It works fine.

However, the function tfidf.listTerms is broken:

tfidf.listTerms(0 /*document index*/).forEach(function(item, indx) {
    console.log(item.term + ': ' + item.tfidf);
});     

returns node programming language: NaN instead of returning the tf-idf measure.

To work around this I had to modify natural's listTerms function this way: terms.push({term: term, tfidf: this.tfidf([term], d)}) instead of terms.push({term: term, tfidf: this.tfidf(term, d)})

I am not sure of how the issue could be fixed in a clean way.

mef avatar Nov 05 '13 13:11 mef