elasticsearch-index-termlist icon indicating copy to clipboard operation
elasticsearch-index-termlist copied to clipboard

Tf-idf

Open fsieduc opened this issue 9 years ago • 8 comments

Hi,

could you add the possibility to have the "tf-idf" (or equivalent) by term, like you've done for the frequency ?

fsieduc avatar Mar 24 '15 08:03 fsieduc

Do you want TF/IDF for the index (shard), or for the document?

jprante avatar Mar 24 '15 09:03 jprante

My objective is to automate the construction of the Completion Suggester's documents (https://www.elastic.co/blog/you-complete-me) with the "best" terms found in an index/type.

Ideally I need the tf-idf to be calculated on some documents of the index/type, not on all documents. Ex: give me the terms and their tf-idf, for the documents in index/type "yesterday/tweets" matching "foo". The corpus is then all the documents (tweets) in the index "yesterday" and the type "tweets" matching "foo".

In that case the tf does not change but the idf does.

So by order or preference I would like to have :

  • the list of terms with theirs tf-idf for a selection of documents found in index/type (utopic)
  • the list of terms with theirs tf-idf for all documents in an index/type
  • the list of terms with theirs tf-idf for all documents in an index

Do you think it is possible for you ?

Thanks

fsieduc avatar Mar 24 '15 10:03 fsieduc

Yes, this is possible.

The selection of documents found in index/type is not utopic. I can walk though a search result with scan/scroll, then retrieve doc-by-doc. This may take extreme amount of time (hours), and might only be available as file output, not over REST API.

I am not sure how this can be useful for completion suggester FST construction. Synonyms, stopwords, phrases, and all the goodies of the Lucene suggesters are not available in term list construction.

jprante avatar Mar 24 '15 10:03 jprante

Ok seems complex, my understanding of ES/Lucene is not suffisant.

Do you think that the option "the list of terms with theirs tf-idf for all documents in an index" is possible via REST ?

fsieduc avatar Mar 24 '15 11:03 fsieduc

Yes, I think so.

jprante avatar Mar 24 '15 11:03 jprante

Great news. Can you imagine adding this option in your project ? (like the &totalfreq=1)

fsieduc avatar Mar 24 '15 13:03 fsieduc

Of course, please stay tuned.

jprante avatar Mar 24 '15 14:03 jprante

Youhhouuu

fsieduc avatar Mar 24 '15 16:03 fsieduc