DataflowTemplates
DataflowTemplates copied to clipboard
[Bug]: ElasticsearchIO does not properly respect getMaxBatchSizeBytes option
Related Template(s)
BigQueryToElasticsearch, GCSToElasticsearch, PubsubToElasticsearch
What happened?
when setting a max batch size bytes option that value is not respected and some batches can be larger. This is particularly evident if setting a batch size close to the maximum allowed by Elasticsearch (when working with 7.17 this appears to be around 4080218931 bytes).. In this case some documents may exceed that size and then cause the job to fail with Elasticsearch errors that the batch size is too large (larger than the maximum 4080218931)
Beam Version
Newer than 2.35.0
Relevant log output
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.io.IOException: Error writing to Elasticsearch, some elements could not be inserted: Document id OrgUnit-5272161830961152-5862129544593408: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [4113463890/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4109805368/3.8gb], new bytes reserved: [3658522/3.4mb], usages [request=0/0b, fielddata=150022532/143mb, in_flight_requests=28833374/27.4mb, model_inference=0/0b, eql_sequence=0/0b, accounting=17609740/16.7mb] (circuit_breaking_exception) Document id OrgUnit-6558191051603968-5629499534213120: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [4113463890/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4109805368/3.8gb], new bytes reserved: [3658522/3.4mb], usages [request=0/0b, fielddata=150022532/143mb, in_flight_requests=28833374/27.4mb, model_inference=0/0b, eql_sequence=0/0b, accounting=17609740/16.7mb] (circuit_breaking_exception) Document id OrgUnit-5839022570209280-5772561977835520: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [4130575290/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4126582584/3.8gb], new bytes reserved: [3992706/3.8mb], usages [request=0/0b, fielddata=150022532/143mb, in_flight_requests=29167558/27.8mb, model_inference=0/0b, eql_sequence=0/0b, accounting=17609740/16.7mb] (circuit_breaking_exception) Document id OrgUnit-5555091201589248-5677948539764736: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [4113463890/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4109805368/3.8gb], new bytes reserved: [3658522/3.4mb], usages [request=0/0b, fielddata=150022532/143mb, in_flight_requests=28833374/27.4mb, model_inference=0/0b, eql_sequence=0/0b, accounting=17609740/16.7mb] (circuit_breaking_exception) Document id OrgUnit-6009931400216576-4512993362313216: [parent] Da..
I believe his could be fixed by doing a check and conditional flush before this line: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/elasticsearch-common/src/main/java/com/google/cloud/teleport/v2/elasticsearch/utils/ElasticsearchIO.java#L1459
It looks like the error in question may not have much to do with the bulk size in bytes and is instead related to the configured JVM heap size: https://discuss.elastic.co/t/org-elasticsearch-common-breaker-circuitbreakingexception-parent-data-too-large-data-for-indices-data-write-bulk-s-r/275660
However, though it is not causing the error mentioned above, it is still a bug in that the code is not respecting to the configured maxBatchSizeBytes