elasticsearch-net icon indicating copy to clipboard operation
elasticsearch-net copied to clipboard

NEST cosSimilarity Script returns bad results

Open LordSaitamaa opened this issue 1 year ago • 0 comments

NEST/Elasticsearch.Net version: 7.17.4 Elasticsearch version: 8.3.3 .NET runtime version: .net 6.0 Operating system version: Win 11

Indexing a dense vector field via NEST leads to bad results in querying. My Docker Volume takes about 80MB space if i'm creating the index via NEST (See code below). The same procedure with the python api leads to 170MB space. The python test with the python elastic api leads to correct results with cosineSimilarity meanwhile in .Net the results are pretty bad and not even close to the queries.

Current Querry and Result

NEST: Query: "verbandwechsel" Current Output with highest Score: Osteotomie, intratrochanter/subtrochanter, als alleinige Leistung

Python: Query: "verbandwechsel" Current output with highest Score: verbandwechsel durch den facharzt nach operativem wundverschluss bestandteil von allgemeine grundleistungen

Code C#:

Index Creation Code:

 .AutoMap()
    .Properties(ps => ps
        .Keyword(k => k.Name(n => n.Ident))
        .Text(t => t.Name(n => n.Bezeichnung))
        .DenseVector(t => t.Name(n => n.BezeichnungProcessedVektor).Dimensions(768))

Property in my DTO:

        public float[] BezeichnungProcessedVektor { get; set; }

Query Code NEST:

 q.ScriptScore(s => s    
                            .Name("bert_search")                         
                            .Script(sn => sn
                                .Source("cosineSimilarity(params.queryVec, doc['BezeichnungProcessedVektor']) + 1.0")
                                .Params( new Dictionary<string, object> { { "queryVec", vector} })
                                )
                            )

Code Python:

Index.json for the Document Property (vector)

            "text_vector": {
                "type": "dense_vector",
                "dims":768
            },

Inserting Data into Index

doc_json = []
 for doc in documents:
     json = {
              '_op_type' : 'index',
              '_index' : 'tarmed',
              'ident': doc.ident,
              'text': doc.text, -> Equal to Bezeichnung in C#
              'text_vector': bert_model.encode(doc.text), -> equal to "BezeichnungProcessedVektor in C#"
               } 
     doc_json.append(json)

bulk(client, doc_json )

Query code (Python):

        response = client.search(
        size=100,
        index="tarmed",
        query={"script_score": {
            "query" : {"match_all" : {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'BezeichnungProcessedVektor') + 1.0",
                "params": {"query_vector": question_embedding}
            }
        }}
        )

Both Programs are based on the exact same Bert Model and the exact same Datafile

Btw. according to the elastic docs there should be more parameters for a dense_vector field than available in NEST. For example: "similarity": "dot_product"

LordSaitamaa avatar Sep 08 '22 09:09 LordSaitamaa