elasticsearch-net
elasticsearch-net copied to clipboard
NEST cosSimilarity Script returns bad results
NEST/Elasticsearch.Net version: 7.17.4 Elasticsearch version: 8.3.3 .NET runtime version: .net 6.0 Operating system version: Win 11
Indexing a dense vector field via NEST leads to bad results in querying. My Docker Volume takes about 80MB space if i'm creating the index via NEST (See code below). The same procedure with the python api leads to 170MB space. The python test with the python elastic api leads to correct results with cosineSimilarity meanwhile in .Net the results are pretty bad and not even close to the queries.
Current Querry and Result
NEST: Query: "verbandwechsel" Current Output with highest Score: Osteotomie, intratrochanter/subtrochanter, als alleinige Leistung
Python: Query: "verbandwechsel" Current output with highest Score: verbandwechsel durch den facharzt nach operativem wundverschluss bestandteil von allgemeine grundleistungen
Code C#:
Index Creation Code:
.AutoMap()
.Properties(ps => ps
.Keyword(k => k.Name(n => n.Ident))
.Text(t => t.Name(n => n.Bezeichnung))
.DenseVector(t => t.Name(n => n.BezeichnungProcessedVektor).Dimensions(768))
Property in my DTO:
public float[] BezeichnungProcessedVektor { get; set; }
Query Code NEST:
q.ScriptScore(s => s
.Name("bert_search")
.Script(sn => sn
.Source("cosineSimilarity(params.queryVec, doc['BezeichnungProcessedVektor']) + 1.0")
.Params( new Dictionary<string, object> { { "queryVec", vector} })
)
)
Code Python:
Index.json for the Document Property (vector)
"text_vector": {
"type": "dense_vector",
"dims":768
},
Inserting Data into Index
doc_json = []
for doc in documents:
json = {
'_op_type' : 'index',
'_index' : 'tarmed',
'ident': doc.ident,
'text': doc.text, -> Equal to Bezeichnung in C#
'text_vector': bert_model.encode(doc.text), -> equal to "BezeichnungProcessedVektor in C#"
}
doc_json.append(json)
bulk(client, doc_json )
Query code (Python):
response = client.search(
size=100,
index="tarmed",
query={"script_score": {
"query" : {"match_all" : {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'BezeichnungProcessedVektor') + 1.0",
"params": {"query_vector": question_embedding}
}
}}
)
Both Programs are based on the exact same Bert Model and the exact same Datafile
Btw. according to the elastic docs there should be more parameters for a dense_vector field than available in NEST. For example: "similarity": "dot_product"