gazelle_plugin Memory leak with TPC-DS benchmark, n2-highmem-32, running power test for 3 rounds

Memory leak with TPC-DS benchmark, n2-highmem-32, running power test for 3 rounds

Open ecopty opened this issue 3 years ago • 0 comments

Describe the bug Hit a memory leak (reported as warning) when running TPC-DS benchmark with gazelle plugin, using configurations/instructions in readme file the only change was that I ran the power test for 3 rounds instead of one: bash bin/tpc_ds.sh run ./repo/confs/gazelle_plugin_performance 3

There has been multiple other errors in the log, specifically after the memory leak error the following error was spotted: ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. Error happens during q67 Snapshot of the errors: ` 22/01/19 22:45:24 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: TakeOrderedAndProject(limit=100, orderBy=[i_category#9773 ASC NULLS FIRST,i_class#9774 ASC NULLS FIRST,i_brand#9775 ASC NULLS FIRST,i_product_name#9776 ASC NULLS FIRST,d_year#9777 ASC NULLS FIRST,d_qoy#9778 ASC NULLS FIRST,d_moy#9779 ASC NULLS FIRST,s_store_id#9780 ASC NULLS FIRST,sumsales#9759 ASC NULLS FIRST,rk#9760 ASC NULLS FIRST], output=[i_category#9773,i_class#9774,i_brand#9775,i_product_name#9776,d_year#9777,d_qoy#9778,d_moy#9779,s_store_id#9780,sumsales#9759,rk#9760]) +- Project [i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, sumsales#9759, rk<>global#9801 AS rk#9760] +- Filter (isnotnull(rk<>global#9801) AND (rk<>global#9801 <= 100)) +- Window [rank(sumsales#9759) windowspecdefinition(i_category#9773, sumsales#9759 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk<>global#9801], [i_category#9773], [sumsales#9759 DESC NULLS LAST] +- Project [i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, sumsales#9759] +- Filter (isnotnull(rk<>local#9800) AND (rk<>local#9800 <= 100)) +- LocalWindow [rank(sumsales#9759) windowspecdefinition(i_category#9773, sumsales#9759 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk<>local#9800], [i_category#9773], [sumsales#9759 DESC NULLS LAST] +- HashAggregate(keys=[i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, spark_grouping_id#9772L], functions=[sum(coalesce(CheckOverflow((promote_precision(cast(ss_sales_price#1399 as decimal(12,2))) * promote_precision(cast(cast(ss_quantity#1396 as decimal(10,0)) as decimal(12,2)))), DecimalType(18,2), true), 0.00))], output=[i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, sumsales#9759]) +- HashAggregate(keys=[i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, spark_grouping_id#9772L], functions=[partial_sum(coalesce(CheckOverflow((promote_precision(cast(ss_sales_price#1399 as decimal(12,2))) * promote_precision(cast(cast(ss_quantity#1396 as decimal(10,0)) as decimal(12,2)))), DecimalType(18,2), true), 0.00))], output=[i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, spark_grouping_id#9772L, sum#9805, isEmpty#9806]) +- Expand [List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, i_product_name#385, d_year#290, d_qoy#294, d_moy#292, s_store_id#465, 0), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, i_product_name#385, d_year#290, d_qoy#294, d_moy#292, null, 1), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, i_product_name#385, d_year#290, d_qoy#294, null, null, 3), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, i_product_name#385, d_year#290, null, null, null, 7), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, i_product_name#385, null, null, null, null, 15), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, null, null, null, null, null, 31), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, null, null, null, null, null, null, 63), List(ss_quantity#1396, ss_sales_price#1399, i_category#376, null, null, null, null, null, null, null, 127), List(ss_quantity#1396, ss_sales_price#1399, null, null, null, null, null, null, null, null, 255)], [ss_quantity#1396, ss_sales_price#1399, i_category#9773, i_class#9774, i_brand#9775, i_product_name#9776, d_year#9777, d_qoy#9778, d_moy#9779, s_store_id#9780, spark_grouping_id#9772L] +- Project [ss_quantity#1396, ss_sales_price#1399, i_category#376, i_class#374, i_brand#372, i_product_name#385, d_year#290, d_qoy#294, d_moy#292, s_store_id#465] +- BroadcastHashJoin [ss_item_sk#1388], [i_item_sk#364], Inner, BuildRight, false :- Project [ss_item_sk#1388, ss_quantity#1396, ss_sales_price#1399, d_year#290, d_moy#292, d_qoy#294, s_store_id#465] : +- BroadcastHashJoin [ss_store_sk#1393], [s_store_sk#464], Inner, BuildRight, false : :- Project [ss_item_sk#1388, ss_store_sk#1393, ss_quantity#1396, ss_sales_price#1399, d_year#290, d_moy#292, d_qoy#294] : : +- BroadcastHashJoin [ss_sold_date_sk#1409], [d_date_sk#284], Inner, BuildRight, false : : :- Project [ss_item_sk#1388, ss_store_sk#1393, ss_quantity#1396, ss_sales_price#1399, ss_sold_date_sk#1409] : : : +- Filter (isnotnull(ss_store_sk#1393) AND isnotnull(ss_item_sk#1388)) : : : +- FileScan arrow tpcds_arrow_partition_scale_1000_db.store_sales[ss_item_sk#1388,ss_store_sk#1393,ss_quantity#1396,ss_sales_price#1399,ss_sold_date_sk#1409] Batched: true, DataFilters: [isnotnull(ss_store_sk#1393), isnotnull(ss_item_sk#1388)], Format: com.intel.oap.spark.sql.execution.datasources.arrow.ArrowFileFormat@70563a62, Location: InMemoryFileIndex[hdfs://aop-gazelle-m/datagen/tpcds_parquet_partition/1000/store_sales/ss_sold_d..., PartitionFilters: [isnotnull(ss_sold_date_sk#1409), dynamicpruning#9802 [ss_sold_date_sk#1409]], PushedFilters: [IsNotNull(ss_store_sk), IsNotNull(ss_item_sk)], ReadSchema: struct<ss_item_sk:int,ss_store_sk:int,ss_quantity:int,ss_sales_price:decimal(7,2)> : : : +- Project [d_date_sk#284, d_year#290, d_moy#292, d_qoy#294] : : : +- Filter (((isnotnull(d_month_seq#287) AND (d_month_seq#287 >= 1200)) AND (d_month_seq#287 <= 1211)) AND isnotnull(d_date_sk#284)) : : : +- Relation[d_date_sk#284,d_date_id#285,d_date#286,d_month_seq#287,d_week_seq#288,d_quarter_seq#289,d_year#290,d_dow#291,d_moy#292,d_dom#293,d_qoy#294,d_fy_year#295,d_fy_quarter_seq#296,d_fy_week_seq#297,d_day_name#298,d_quarter_name#299,d_holiday#300,d_weekend#301,d_following_holiday#302,d_first_dom#303,d_last_dom#304,d_same_day_ly#305,d_same_day_lq#306,d_current_day#307,... 4 more fields] arrow : : +- Project [d_date_sk#284, d_year#290, d_moy#292, d_qoy#294] : : +- Filter (((isnotnull(d_month_seq#287) AND (d_month_seq#287 >= 1200)) AND (d_month_seq#287 <= 1211)) AND isnotnull(d_date_sk#284)) : : +- FileScan arrow tpcds_arrow_partition_scale_1000_db.date_dim[d_date_sk#284,d_month_seq#287,d_year#290,d_moy#292,d_qoy#294] Batched: true, DataFilters: [isnotnull(d_month_seq#287), (d_month_seq#287 >= 1200), (d_month_seq#287 <= 1211), isnotnull(d_da..., Format: com.intel.oap.spark.sql.execution.datasources.arrow.ArrowFileFormat@701f8b18, Location: InMemoryFileIndex[hdfs://aop-gazelle-m/datagen/tpcds_parquet_partition/1000/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_month_seq), GreaterThanOrEqual(d_month_seq,1200), LessThanOrEqual(d_month_seq,1211),..., ReadSchema: struct<d_date_sk:int,d_month_seq:int,d_year:int,d_moy:int,d_qoy:int> : +- Project [s_store_sk#464, s_store_id#465] : +- Filter isnotnull(s_store_sk#464) : +- FileScan arrow tpcds_arrow_partition_scale_1000_db.store[s_store_sk#464,s_store_id#465] Batched: true, DataFilters: [isnotnull(s_store_sk#464)], Format: com.intel.oap.spark.sql.execution.datasources.arrow.ArrowFileFormat@6ec85c7c, Location: InMemoryFileIndex[hdfs://aop-gazelle-m/datagen/tpcds_parquet_partition/1000/store], PartitionFilters: [], PushedFilters: [IsNotNull(s_store_sk)], ReadSchema: struct<s_store_sk:int,s_store_id:string> +- Project [i_item_sk#364, i_brand#372, i_class#374, i_category#376, i_product_name#385] +- Filter isnotnull(i_item_sk#364) +- FileScan arrow tpcds_arrow_partition_scale_1000_db.item[i_item_sk#364,i_brand#372,i_class#374,i_category#376,i_product_name#385] Batched: true, DataFilters: [isnotnull(i_item_sk#364)], Format: com.intel.oap.spark.sql.execution.datasources.arrow.ArrowFileFormat@19c69075, Location: InMemoryFileIndex[hdfs://aop-gazelle-m/datagen/tpcds_parquet_partition/1000/item], PartitionFilters: [], PushedFilters: [IsNotNull(i_item_sk)], ReadSchema: struct<i_item_sk:int,i_brand:string,i_class:string,i_category:string,i_product_name:string> .

[Stage 1114:================================================> (83 + 9) / 92] 22/01/19 22:48:39 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1642622251224_0002_01_000007 on host: aop-gazelle-w-0.c.articulate-rain-321323.internal. Exit status: 134. Diagnostics: ion.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference isEmpty#9806 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference i_category#9773 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference i_class#9774 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference i_brand#9775 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference i_product_name#9776 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference d_year#9777 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference d_qoy#9778 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference d_moy#9779 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference s_store_id#9780 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference spark_grouping_id#9772L is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference sum#9805 is supported, no_cal is false. 22/01/19 22:47:31 INFO com.intel.oap.expression.ColumnarExpressionConverter: class org.apache.spark.sql.catalyst.expressions.AttributeReference isEmpty#9806 is supported, no_cal is false. 22/01/19 22:47:48 WARN org.apache.spark.memory.ExecutionMemoryPool: Internal error: release called on 8388608 bytes but task only has 0 bytes of memory from the off-heap execution pool 22/01/19 22:47:52 WARN org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils: Detected leaked allocator, size: 8192... 22/01/19 22:47:52 WARN org.apache.spark.executor.Executor: Managed memory leak detected; size = 8388608 bytes, task 0.0 in stage 1114.0 (TID 63969)

[2022-01-19 22:48:39.413]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. ERROR: ld.so: object '/usr/lib64/libjemalloc.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. /bin/bash: line 1: 28554 Aborted (core dumped) LD_LIBRARY_PATH="/opt/benchmark-tools/oap/lib:/opt/benchmark-tools/oap/lib" /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx8192m '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=5' '-XX:NewRatio=1' '-XX:SurvivorRatio=1' '-XX:+UseCompressedOops' '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCTimeStamps' -Djava.io.tmpdir=/mnt/2/hadoop/yarn/nm-local-dir/usercache/eman_copty_intel_com/appcache/application_1642622251224_0002/container_1642622251224_0002_01_000007/tmp '-Dspark.driver.port=46363' '-Dspark.network.timeout=3600s' '-Dspark.authenticate=false' '-Dspark.ui.port=0' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1642622251224_0002/container_1642622251224_0002_01_000007 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@aop-gazelle-m.c.articulate-rain-321323.internal:46363 --executor-id 7 --hostname aop-gazelle-w-0.c.articulate-rain-321323.internal --cores 8 --app-id application_1642622251224_0002 --resourceProfileId 0 --user-class-path file:/mnt/2/hadoop/yarn/nm-local-dir/usercache/eman_copty_intel_com/appcache/application_1642622251224_0002/container_1642622251224_0002_01_000007/app.jar --user-class-path file:/mnt/2/hadoop/yarn/nm-local-dir/usercache/eman_copty_intel_com/appcache/application_1642622251224_0002/container_1642622251224_0002_01_000007/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar > /var/log/hadoop-yarn/userlogs/application_1642622251224_0002/container_1642622251224_0002_01_000007/stdout 2> /var/log/hadoop-yarn/userlogs/application_1642622251224_0002/container_1642622251224_0002_01_000007/stderr Last 4096 bytes of stderr : s supported, no_cal is false.`

To Reproduce with oap 1.3.0.dataproc20 Create a cluster in Dataproc using instructions in this [link] (https://github.com/oap-project/oap-tools/blob/master/integrations/oap/dataproc/benchmark/Gazelle_on_Dataproc.md ) , 4 local SSDs per worker, ubuntu-18 - n2-highmem-32

Expected behavior Runs memory leak free

Jan 26 '22 18:01 ecopty

gazelle_plugin gazelle_plugin copied to clipboard

Memory leak with TPC-DS benchmark, n2-highmem-32, running power test for 3 rounds

gazelle_plugin
gazelle_plugin copied to clipboard