elasticsearch-index-termlist
elasticsearch-index-termlist copied to clipboard
Tf-idf
Hi,
could you add the possibility to have the "tf-idf" (or equivalent) by term, like you've done for the frequency ?
Do you want TF/IDF for the index (shard), or for the document?
My objective is to automate the construction of the Completion Suggester's documents (https://www.elastic.co/blog/you-complete-me) with the "best" terms found in an index/type.
Ideally I need the tf-idf to be calculated on some documents of the index/type, not on all documents. Ex: give me the terms and their tf-idf, for the documents in index/type "yesterday/tweets" matching "foo". The corpus is then all the documents (tweets) in the index "yesterday" and the type "tweets" matching "foo".
In that case the tf does not change but the idf does.
So by order or preference I would like to have :
- the list of terms with theirs tf-idf for a selection of documents found in index/type (utopic)
- the list of terms with theirs tf-idf for all documents in an index/type
- the list of terms with theirs tf-idf for all documents in an index
Do you think it is possible for you ?
Thanks
Yes, this is possible.
The selection of documents found in index/type is not utopic. I can walk though a search result with scan/scroll, then retrieve doc-by-doc. This may take extreme amount of time (hours), and might only be available as file output, not over REST API.
I am not sure how this can be useful for completion suggester FST construction. Synonyms, stopwords, phrases, and all the goodies of the Lucene suggesters are not available in term list construction.
Ok seems complex, my understanding of ES/Lucene is not suffisant.
Do you think that the option "the list of terms with theirs tf-idf for all documents in an index" is possible via REST ?
Yes, I think so.
Great news. Can you imagine adding this option in your project ? (like the &totalfreq=1)
Of course, please stay tuned.
Youhhouuu