elasticsearch-minhash
elasticsearch-minhash copied to clipboard
_score values for predicting similar news articles
Let's assume I got a set of news articles in my ES store. Is there a way to use MinHash score value to check a new article if it fits to any article in ES. So what I want to acchieve is the following: let's assume there are 2 articles on the same subject. One is from MSNBC and the other from TheGuardian. How can I recognize by the score value that they represent the same subject?
Using "copy_bits_to" parameter, you can copy minhash value as visible bits string(ex. 01011...), and then I think that similar documents are found with term or fuzzy query.
https://github.com/codelibs/elasticsearch-minhash/blob/master/src/test/java/org/codelibs/elasticsearch/minhash/MinHashPluginTest.java#L99
Thank you for your answer. I'll try that asap and get back to u. Best
Is it correct that I would create the mapping like that:
$ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{ "my_type":{ "properties":{ "message":{ "type":"string", "copy_to":"minhash_value" }, "minhash_value":{ "type":"minhash", "minhash_analyzer":"minhash_analyzer", "copy_bits_to" : "minhash_bits" }, "minhash_bits":{ "type":"string" } } } }'
If you want to make "minhash_bits" value visible, it's better to add "store":true.
Hello, I tried copy_bits_to operator, but it does not show me the created field: PUT /test_minhash_test { "index":{ "analysis":{ "analyzer":{ "minhash_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":["minhash"] } } } } }
PUT /test_minhash_test/_doc/_mapping { "_doc":{ "properties":{ "message": { "type":"text", "copy_to":"minhash_value" }, "minhash_value":{ "type":"minhash", "minhash_analyzer":"minhash_analyzer", "store":true, "copy_bits_to": "content_minhash_bits" }, "content_minhash_bits": { "type": "keyword", "store":true } } } }
GET /test_minhash_test/_doc/_search/?pretty&stored_fields=*