OpenSearch-Dashboards icon indicating copy to clipboard operation
OpenSearch-Dashboards copied to clipboard

[BUG] opensearch-with-long-numerals runs into timeout

Open rlueckl opened this issue 11 months ago • 5 comments

Describe the bug

Don't really know how to describe this. OpenSearch Dashboards 2.12.0 fails to fetch data resulting in a timeout, truncated response and broken JSON where OpenSearch Dashboards 2.11.0 works perfectly fine.

To Reproduce Don't know. Tried to compare 2.11.0 with 2.12.0. The only difference I found is that 2.12.0 calls POST /internal/search/opensearch-with-long-numerals whereas 2.11.0 calls POST /internal/search/opensearch for the exact same query. So there might be a problem with the "long-numerals" part.

The query is a simple 15 second time window on one of our indices. 2.11.0 gives back 397 hits with a response size of 1,02MB within 260ms according to the developer console. 2.12.0 runs into a timeout (120sec) then throws the following error:

JSON.parse: expected ',' or '}' after property value in object at line 1 column 306251 of the JSON data

HttpFetchError@https://host03.server.lan/7326/bundles/core/core.entry.js:15:184257
fetchResponse@https://host03.server.lan/7326/bundles/core/core.entry.js:15:191557

The response size is also 1,02MB (after all it's the same query).

No errors visible in the log (journalctl -u opensearch-dashboards.service) of OpenSearch Dashboards.

Expected behavior Dashboards 2.12.0 works the same as 2.11.0

OpenSearch Version 2.12.0 (Debian Package installed from artifacts.opensearch.org/releases/bundle/opensearch/2.x/apt)

Dashboards Version 2.12.0 (Debian Package installed from artifacts.opensearch.org/releases/bundle/opensearch-dashboards/2.x/apt)

Plugins

# OPENSEARCH_JAVA_HOME=/usr/share/opensearch/jdk /usr/share/opensearch/bin/opensearch-plugin list 
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-skills
opensearch-sql
prometheus-exporter

(Prometheus Exporter Plugin from: Aiven-Open/prometheus-exporter-plugin-for-opensearch)

Screenshots

Comparing request & response headers with Meld: compare_request_response_headers

Exact same query with the same request and response sizes results in different runtimes and error on 2.12.0 (/internal/search/opensearch-with-long-numerals) vs. 2.11.0 (/internal/search/opensearch)

2.11.0 works as expected: dashboards_2 11 0

2.12.0 timeouts and throws error: dashboards_2 12 0

Host/Environment (please complete the following information):

  • Server OS: Debian 12 Bookworm
  • Client OS: Linux Mint 21.3 Virginia
  • Firefox 123.0

rlueckl avatar Mar 01 '24 08:03 rlueckl

I could narrow it down to one specific log from a Cassandra system.log which apparently causes the timeout/JSON Parse error in 2.12.0. Two examples attached:

cassandra_example1.log cassandra_example2.log

The "message" and "logmessage" fields are quite long, but it's a normal output for Cassandra and causes no issues in Dashboards 2.11.0

Looking at the examples the error apparently happens when Dashboards parses the message field:

cassandra_example1.log:
Completing uncommitted paxos instances for ****** on ranges [(9206423891869844203,-9207833944162114199], 
                                                                                  ^ this is where the syntax error happens

cassandra_example2.log:
Completed 0 uncommitted paxos instances for ****** on ranges [(9206423891869844203,-9207833944162114199],
                                                                                   ^ this is where the syntax error happens

So it looks like that the error is happening within a String (the "message" field). Why does Dashboards try to parse this string as JSON???

Settings for this particular index and fields: index_settings

rlueckl avatar Mar 01 '24 08:03 rlueckl

I've created a smaller example which also throws the JSON parse error in Dashboards 2.12.0:

Steps to reproduce:

  • Add the following document to an index in your opensearch:
$ curl -v -H "Content-Type: application/json" -X POST "https://myopensearchhost01.server.lan:9200/logstash-2024.03.01/_doc" -d@minimal_example.json -u "user:pass"

minimal_example.json

  • Use "Discover" in OpenSearch Dashboards 2.12.0 and try to query a timerange which contains the document (Feb. 28th, 05:32).
  • You'll get the above mentioned exception.
  • Same thing with OpenSearch Dashboards 2.11.0 works fine.

rlueckl avatar Mar 01 '24 12:03 rlueckl

We are also facing the same problem after 2.12.0 upgrade any leads or fix would be appriciated. Surprising it's only happening with some specific indexes.

atreyd avatar Mar 02 '24 11:03 atreyd

@AMoo-Miki is this fixed? could you double check and resolve this issue?

ananzh avatar Mar 05 '24 18:03 ananzh

@ananzh - can you please share details about the fixed release verison for this issue or if it's included in future release.

atreyd avatar Mar 07 '24 07:03 atreyd

This is occurring for me as well. Is there any update to this?

msoler8785 avatar Apr 03 '24 14:04 msoler8785

Looks like this may have been addressed in the 2.13 release here: https://github.com/opensearch-project/OpenSearch-Dashboards/issues/6134

msoler8785 avatar Apr 03 '24 14:04 msoler8785

Can anybody confirm if the bug has been fixed in 2.13.0? I don't have a test cluster unfortunately.

I've tried updating Dashboards only, but it seems that it's not backwards compatible with server version 2.12.0:

{"type":"log","@timestamp":"2024-04-10T06:03:55Z","tags":["error","savedobjects-service"],"pid":330933,"message":"This version of OpenSearch Dashboards (v2.13.0) is incompatible with the following OpenSearch nodes in your cluster: v2.12.0 @ hostname01.lan/10.x.x.x:9200 (10.x.x.x), v2.12.0 @ hostname02.lan/10.x.x.x:9200 (10.x.x.x)"}

rlueckl avatar Apr 10 '24 11:04 rlueckl

We have taken the hotfix and built our own 2.12 snapshot version. Using your minimal data lead to no error.

We run the data on our 2.13 test cluster and no problem either.

image

Seems it is safe to upgrade to 2.13 regarding the long numerals bug. For safety reasons we will wait out the community experience on 2.13.

cinhtau avatar Apr 12 '24 12:04 cinhtau

Hi @cinhtau ,

please see my comment in #6134 : the minimal example works now, but longer examples still lead to loops: https://github.com/opensearch-project/OpenSearch-Dashboards/issues/6134#issuecomment-2049586388

rlueckl avatar Apr 12 '24 13:04 rlueckl

#6377 seems to be related

cinhtau avatar Apr 12 '24 16:04 cinhtau