elasticsearch-minhash icon indicating copy to clipboard operation
elasticsearch-minhash copied to clipboard

_score values for predicting similar news articles

Open sudo2012 opened this issue 8 years ago • 5 comments

Let's assume I got a set of news articles in my ES store. Is there a way to use MinHash score value to check a new article if it fits to any article in ES. So what I want to acchieve is the following: let's assume there are 2 articles on the same subject. One is from MSNBC and the other from TheGuardian. How can I recognize by the score value that they represent the same subject?

sudo2012 avatar Sep 25 '15 11:09 sudo2012

Using "copy_bits_to" parameter, you can copy minhash value as visible bits string(ex. 01011...), and then I think that similar documents are found with term or fuzzy query.

https://github.com/codelibs/elasticsearch-minhash/blob/master/src/test/java/org/codelibs/elasticsearch/minhash/MinHashPluginTest.java#L99

marevol avatar Sep 25 '15 22:09 marevol

Thank you for your answer. I'll try that asap and get back to u. Best

sudo2012 avatar Sep 26 '15 07:09 sudo2012

Is it correct that I would create the mapping like that:

$ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{ "my_type":{ "properties":{ "message":{ "type":"string", "copy_to":"minhash_value" }, "minhash_value":{ "type":"minhash", "minhash_analyzer":"minhash_analyzer", "copy_bits_to" : "minhash_bits" }, "minhash_bits":{ "type":"string" } } } }'

sudo2012 avatar Sep 26 '15 08:09 sudo2012

If you want to make "minhash_bits" value visible, it's better to add "store":true.

marevol avatar Sep 27 '15 05:09 marevol

Hello, I tried copy_bits_to operator, but it does not show me the created field: PUT /test_minhash_test { "index":{ "analysis":{ "analyzer":{ "minhash_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":["minhash"] } } } } }

PUT /test_minhash_test/_doc/_mapping { "_doc":{ "properties":{ "message": { "type":"text", "copy_to":"minhash_value" }, "minhash_value":{ "type":"minhash", "minhash_analyzer":"minhash_analyzer", "store":true, "copy_bits_to": "content_minhash_bits" }, "content_minhash_bits": { "type": "keyword", "store":true } } } }

GET /test_minhash_test/_doc/_search/?pretty&stored_fields=*

Fred12 avatar Mar 18 '19 08:03 Fred12