delta icon indicating copy to clipboard operation
delta copied to clipboard

The delta log is too large, and OOM always occurs when the delta log checkpoint is executed.[BUG]

Open kongluo opened this issue 3 years ago • 1 comments

Bug

At present, the latest data size of the delta log is as follows, checkpointSize: 7639547 numOfFiles: 5140110

image

At present, the given memory size of each executor is 18G, and park.shuffle.memoryFraction=03. However, in the process of operating delta lake, structured streaming still takes time when performing delta log checkpoint, and OOM occurs frequently. I calculate that the memory provided to shuffle is completely enough for parquet now, why does this happen? Is it because of some internal reasons in delta lake?

image

Environment information

  • Delta Lake version: 1.0.1
  • Spark version: 3.1.1
  • Scala version: 2.12.8

kongluo avatar Jun 22 '22 08:06 kongluo

Sorry for the massive delay but are you still having this problem? Are you sure its failing during snapshot construction? Can you read the table in a batch query? If you can, then I suggest you run Optimize (introduced in Delta 1.2 on Spark 3.2) to compact the files to reduce the number of files in delta table. that should reduce the overheads of creating snapshot.

I am honestly surprised though that this is happening. Spark is fairly robust at handling shuffle data sizes more than the memory by spilling. So i dont know what is happening here.... what is causing your job to be fail. If you can query the delta table in a batch query and see the same issue, then we can debug from there. I would look at the details of the Spark job that is calculating the snapshot - how many tasks, how much shuffle data, executor memory usage, etc.

tdas avatar Aug 16 '22 17:08 tdas