quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Double sorting with aggregation not working

Open KlimTodrik opened this issue 1 year ago • 2 comments

Describe the bug

When we request collection with aggregation with sorting by two fields we see two bugs:

  • Missing sorting
  • Different results for each call

Steps to reproduce (if applicable)

  1. Download dataset and index config from https://dev2.manticoresearch.com/index-settings-and-data.zip
  2. Run Quickwit in Docker quickwit/quickwit:0.8.1
  3. Create index (config provided in attached archive):
export HOST='http://localhost:7280'

curl -s -XPOST "${HOST}/api/v1/indexes" \
    --header "content-type: application/yaml" \
    --data-binary @./index-config.yaml
  1. Upload data (Dataset is pretty big, so we split it into chunks):
split -l 10000 ./data.jsonl ./data_splitted.

echo "Starting loading"
for f in ./data_splitted.*; do
    echo "Upload chunk $f"
    curl -s -XPOST "${HOST}/api/v1/hn_small/ingest?commit=force" --data-binary @$f
    rm $f
done
echo "Finished"
  1. Perform query:
curl --location '${HOST}/api/v1/hn_small/search' \
--header 'Content-Type: application/json' \
--data '{"query":"*","max_hits":0,"aggs":{"comment_ranking_avg":{"terms":{"field":"comment_ranking","size":20,"order":{"avg_field":"desc","_key":"desc"}},"aggs":{"avg_field":{"avg":{"field":"author_comment_count"}}}}}}'
  1. We got results with the wrong sorting
{
    "num_hits": 1165439,
    "hits": [],
    "elapsed_time_micros": 6665,
    "errors": [],
    "aggregations": {
        "comment_ranking_avg": {
            "buckets": [
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 928.0 # Should be 2nd
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 961.0 # Should be 1st
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 730.0
                },
.....                
  1. if you repeat the request several times it can return different results (for the same query)
{
    "num_hits": 1165439,
    "hits": [],
    "elapsed_time_micros": 9610,
    "errors": [],
    "aggregations": {
        "comment_ranking_avg": {
            "buckets": [
                {
                    "avg_field": {
                        "value": 64.0
                    },
                    "doc_count": 1,
                    "key": 1305.0
                },
                {
                    "avg_field": {
                        "value": 117.0
                    },
                    "doc_count": 1,
                    "key": 1296.0
                },
                {
                    "avg_field": {
                        "value": 40.0
                    },
                    "doc_count": 1,
                    "key": 1289.0
                },
                {
                    "avg_field": {
                        "value": 87.0
                    },
                    "doc_count": 1,
                    "key": 1287.0
                },
......

PS: Sometimes it returns results without grouping. In that case you should reindex your dataset

"buckets": [
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 961.0
                },
                {
                    "avg_field": {
                        "value": 3080.0
                    },
                    "doc_count": 1,
                    "key": 980.0
                },
                {
                    "avg_field": {
                        "value": 3077.0
                    },
                    "doc_count": 1,
                    "key": 1176.0
                },

So generally we can get 3 different results for one query.

PS: Elasticsearch compatible URL has the same behaviour

Expected behavior It should return the dataset like provided below

{
    "num_hits": 1165439,
    "hits": [],
    "elapsed_time_micros": 6665,
    "errors": [],
    "aggregations": {
        "comment_ranking_avg": {
            "buckets": [
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 961.0
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key":  928.0
                },
                {
                    "avg_field": {
                        "value": 3504.0
                    },
                    "doc_count": 1,
                    "key": 730.0
                },
..... 

Configuration: Please provide:

  1. Output of quickwit --version
Quickwit 0.8.1 (aarch64-unknown-linux-gnu 2024-03-29T14:09:41Z e6c5396)
  1. The index_config.yaml
 Provided in the attached archive)

KlimTodrik avatar Jun 13 '24 09:06 KlimTodrik

{
  "query": "*",
  "max_hits": 0,
  "aggs": {
    "comment_ranking_avg": {
      "terms": {
        "field": "comment_ranking",
        "size": 20,
        "order": {
          "avg_field": "desc",
          "_key": "desc"
        }
      },
      "aggs": {
        "avg_field": {
          "avg": {
            "field": "author_comment_count"
          }
        }
      }
    }
  }
}

This is not a correct way to define the order. It should be:

"order": [ { "avg_field": "desc" }, { "_key":"desc" } ] 

But currently this is not supported, only sort by one field is supported currently.

PSeitz avatar Jun 13 '24 10:06 PSeitz

Provided order is not working also, but it's still not implemented

curl --location 'http://127.0.0.1:7280/api/v1/hn_small/search' \
--header 'Content-Type: application/json' \
--data '{
    "query": "*",
    "max_hits": 0,
    "aggs": {
        "comment_ranking_avg": {
            "terms": {
                "field": "comment_ranking",
                "size": 20,
                "order": [
                    {
                        "avg_field": "desc"
                    },
                    {
                        "_key": "desc"
                    }
                ]
            },
            "aggs": {
                "avg_field": {
                    "avg": {
                        "field": "author_comment_count"
                    }
                }
            }
        }
    }
}'
{
    "message": "invalid aggregation request: invalid type: sequence, expected a map at line 1 column 180"
}

So with an order by one key, it works fine and gives the same results each call.

Probably you just should notice somewhere in docs that you support now only one argument for sorting.

KlimTodrik avatar Jun 13 '24 11:06 KlimTodrik

@PSeitz will the issue be closed with the merged PR https://github.com/quickwit-oss/quickwit/pull/5121 ?

fmassot avatar Jul 14 '24 21:07 fmassot

There's also https://github.com/quickwit-oss/tantivy/pull/2451

But it's just covering error handling, not implementing order by multiple fields

PSeitz avatar Jul 15 '24 00:07 PSeitz