datahub Endpoint /browse/dataset results in 500 HTTP Response as soon as datasets reach beyond 20000

Issue On a clean setup of datahub:v_0_8_12 as soon as the number of datasets reach beyond 20000, the /browse/dataset endpoint on UI results in 500 response code.

Datahub-GMS logs are as follows:

Sep 16, 2021 @ 13:02:34.000 07:32:34.561 [qtp544724190-12] ERROR c.l.m.s.e.query.ESBrowseDAO - Browse query failed: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]

Sep 16, 2021 @ 13:02:34.000 07:32:34.565 [Thread-528] ERROR c.l.d.g.r.browse.BrowseResolver - Failed to execute browse: entity type: DATASET, path: [prod], filters: null, start: 0, count: 10 com.linkedin.data.template.RequiredFieldNotPresentException: Field "value" is required but it is not present

ES logs are as follows:

[2021-09-21T14:48:32,508][DEBUG][o.e.a.s.TransportSearchAction] [node-2-mn] [datasetindex_v2][0], node[-dQxg0RDTBGIYZcEaT2pZA], [R], s[STARTED], a[id=Y1OUJhAcTTithvL8M_yDzw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[datasetindex_v2], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=5, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"size":0,"query":{"bool":{"filter":[{"range":{"browsePaths.length":{"from":1,"to":null,"include_lower":false,"include_upper":true,"boost":1.0}}}],"must_not":[{"term":{"removed":{"value":"true","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"groups":{"terms":{"field":"browsePaths","size":2147483647,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}],"include":"/.*","exclude":"/.*/.*"},"aggregations":{"allPaths":{"terms":{"field":"browsePaths","size":2147483647,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}}}}]
org.elasticsearch.transport.RemoteTransportException: [node-5-dn][:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.aggregations.MultiBucketConsumerService$TooManyBucketsException: Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.

Discussions in ES forums about this issue: TooManyBucketsException: Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.

https://discuss.elastic.co/t/search-max-buckets-limit-error-on-7-0-1/179989/2

It is not recommended to set the [search.max_buckets] to INTEGER.MAX_VALUE at cluster level as it may cause performance issues.

To Reproduce Steps to reproduce the behavior:

git clone https://github.com/linkedin/datahub.git
git checkout tags/v0.8.12 -b v0.8.12
cd datahub/docker && docker-compose up
datahub ingest -c file_with_30k_datasets.yml
Browse http://localhost:9002/browse/dataset

Expected behavior Datahub should display all datasets without issues.

Desktop (please complete the following information):

OS: CENTOS7
Browser :Chromme
Version v_0_8_12

**Additional Context ** https://app.slack.com/client/TUMKD5EGJ/CV2UXSE9L/thread/CV2UXSE9L-1632251996.136600

Sep 24 '21 05:09 abiwill

Hi there! This has to do with the elastic settings. Changing the search.max_buckets cluster setting default.

In your case, you'd want to change that setting to be increased in production. See more here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html

Jan 25 '22 20:01 jjoyce0510

Is this still an issue? If not, will close it in a few days due to inactivity.

Jul 25 '22 18:07 anshbansal

@anshbansal we and other folks are facing the same issue still as discussed in this slack thread

Jul 27 '22 17:07 salihcaan

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

Sep 15 '22 02:09 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

Oct 16 '22 02:10 github-actions[bot]

I'm going to close this in favor of this issue, since they're duplicates, and they're being tracked in one place!

https://github.com/datahub-project/datahub/issues/4575

Jan 24 '23 05:01 aditya-radhakrishnan

datahub datahub copied to clipboard

Endpoint /browse/dataset results in 500 HTTP Response as soon as datasets reach beyond 20000

datahub
datahub copied to clipboard