hudi
hudi copied to clipboard
[SUPPORT] SqlQueryBasedTransformer causes memory issues
Describe the problem you faced
With a DeltaStreamer job that runs fine before, adding a SqlQueryBasedTransformer that only SELECTs 1 column runs into memory issues.
"--transformer-class", "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer", "--hoodie-conf", "hoodie.deltastreamer.transformer.sql=SELECT a.ATTRIBUTES FROM <SRC> a"
To Reproduce
Steps to reproduce the behavior:
- Add SqlQueryBasedTransformer with simple SELECT statement to a DeltaStreamer job
- Run job
Expected behavior
Getting back one column from the job, without memory issues
Environment Description
-
Hudi version : 0.10.1
-
Spark version : 3.1.2
-
Hive version : -
-
Hadoop version : 3.1.2
-
Storage (HDFS/S3/GCS..) : Reading from Kafka, storing in S3
-
Running on Docker? (yes/no) : no
Additional context
Some additional screenshots and messages in this slack thread: https://apache-hudi.slack.com/archives/C4D716NPQ/p1663698444989499
Stacktrace
│ 2022-09-19T21:45:44.236+0000: [GC (Allocation Failure) [PSYoungGen: 25113K->25029K(2758656K)] 77023K->76946K(8351232K), 0.0177561 secs] [Times: user=0.02 sys=0.02, real=0.02 secs] │
│ 2022-09-19T21:45:44.254+0000: [Full GC (Allocation Failure) [PSYoungGen: 25029K->0K(2758656K)] [ParOldGen: 51917K->54295K(5592576K)] 76946K->54295K(8351232K), [Metaspace: 112463K->112463K(1155072K)], 0. │
│ 2022-09-19T21:45:44.378+0000: [GC (Allocation Failure) [PSYoungGen: 0K->0K(2720768K)] 54295K->54295K(8313344K), 0.0035697 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] │
│ 2022-09-19T21:45:44.381+0000: [Full GC (Allocation Failure) [PSYoungGen: 0K->0K(2720768K)] [ParOldGen: 54295K->45261K(5592576K)] 54295K->45261K(8313344K), [Metaspace: 112463K->109953K(1155072K)], 0.1912 │
│ # │
│ # java.lang.OutOfMemoryError: Java heap space │
│ # -XX:OnOutOfMemoryError="kill -9 %p" │
│ # Executing /bin/sh -c "kill -9 22"..```