eland filter only works for the first couple of records

When accessing an elastic search index using ELAND via an index query like:

el_df = ed.DataFrame(es_client=es, es_index_pattern='my_index-2021.06.21')
el_df[el_df['key'] == 'value']

It retruns 0 matching data records, however:

el_df['SysGroup'].to_pandas().value_counts()

shows many results.

How can I get the ELAND dataframe to apply the filter /value_counts operation not only on the first 500 rows but the full index?

Jun 21 '21 08:06 geoHeil

I'm not sure if I'm misunderstanding your question, are the two code snippets related somehow or are they independent?

To get filter operations to "stick" you have to store the result somewhere (ie eland.DataFrames are immutable)

el_def = el_df[el_df['key'] == 'value']

What sort of output are you receiving for Eland's value_counts() versus pandas? Could you post an example (at least of the metadata, you can scrub the data itself if necessary)

Jun 21 '21 19:06 sethmlarson

Well I do not need it to stick rather I observe that a materilization (to_pandas) returns the expected result in value_counts (all categories are present) whereas directly calling value_counts on the ELAND dataframe is only performing the value counts on the first 500 (or so) records and not the full dataframe. How can I tell ELAND to perform this operation on the full dataframe?

Jun 22 '21 07:06 geoHeil

@geoHeil I wonder if you're running into an issue with bucketing aggregations not being "scrollable", ie at some point Elasticsearch aggregations with high cardinality of documents won't return you further results. However the composite aggregation (which works with the terms agg, how eland.Series.value_counts() is implemented) allows for "pagination", perhaps there's an improvement to be made to Eland here?

Do you see the same results as is given in Eland as you see with this ES query?

POST https://<cluster>/<index-pattern>/_search?size=0
{"aggs":{"value_counts":{"terms":{"field":"SysGroup","size":1000}}},"query":{"exists":{"field":"SysGroup"}}}

Jun 22 '21 19:06 sethmlarson

Let me check. It certainly is a high cardinality thing. And could well be pagination.

Jun 23 '21 09:06 geoHeil

Did you mean:

curl -u user:password -d '{"aggs":{"value_counts":{"terms":{"field":"SysGroup","size":1000}}},"query":{"exists":{"field":"SysGroup"}}}' -H "Content-Type: application/json" -X POST https://host/index/_search?size=0

if yes, this fails with:

{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Unexpected character (''' (code 39)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')\n at [Source: (org.elasticsearch.common.io.stream.InputStreamStreamInput); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Unexpected character (''' (code 39)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')\n at [Source: (org.elasticsearch.common.io.stream.InputStreamStreamInput); line: 1, column: 2]"},"status":400}

Jun 23 '21 09:06 geoHeil

You can make the request with the Python client like so:

from elasticsearch import Elasticsearch

es = Elasticsearch("https://<host>:<port>")
es.search(
    index="<index>",
    body={
        "aggs": {"value_counts": {"terms": {"field": "SysGroup", "size": 1000}}},
        "query": {"exists": {"field": "SysGroup"}},
    },
)

Maybe something with that terminal command is encoding the JSON weird?

Jun 23 '21 17:06 sethmlarson

That fails with:

RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [SysGroup] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')

according to https://stackoverflow.com/questions/62043037/elasticsearch-want-to-sort-by-field-in-all-indices-where-that-particular-field the field mapping needs to be changed. I do not have ther permissions to do so but will need to request this by some technician. Is there another way to query this information directly meanwhile?

Jun 23 '21 17:06 geoHeil

Is there by chance a SysGroup.keyword field? (Or another subfield mapped as keyword)

es.search(
    index="<index>",
    body={
        "aggs": {"value_counts": {"terms": {"field": "SysGroup.keyword", "size": 1000}}},
        "query": {"exists": {"field": "SysGroup"}},
    },
)

Jun 23 '21 18:06 sethmlarson

Well there is:

"FOO_TIMESTAMP": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },

some timestamp column apparently mapped a) as text and b) as keyword. However these are not sub fields. When querying that one some computation is started successfully.

The result of this query is a timeout - it does not complete within 500 seconds. However, the nature of this field (= timestamp) does not really seem to be useful for this particular query.

Jun 23 '21 19:06 geoHeil

this seems to work now with the keyword field.

es_res = es.search(
    index="<<index>>",
    body={
        "aggs": {"value_counts": {"terms": {"field": "SysGroup.keyword", "size": 1000}}},
        "query": {"exists": {"field": "SysGroup"}},
    },
)

pd.DataFrame(es_res['aggregations']['value_counts']['buckets'])

however, as written before:

el_df = ed.DataFrame(es_client=es, es_index_pattern='<<index>>')
el_df[el_df['SysGroup'] == '<<value>>']

whereas the manual es.search query return more than 1500 documents for that particular key.

Jul 01 '21 07:07 geoHeil

filter only works for the first couple of records - not the full index