iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Open fengsen-neu opened this issue 3 years ago • 5 comments

When we use spark action rewriteDataFiles on v2 table, we will also merge equality_delete file with data file. The equality_delete file will merge with all data file that sequence num less than it, this will cause spark executor memory GC frequently, and cause 'connection reset by peer' , ' hearbeat timeout' error during rewriteDataFiles. How could we limit the jvm using memory when rewriteDataFiles with equality_delete file compations?

fengsen-neu avatar Jan 17 '22 09:01 fengsen-neu

A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys to filter out unnecessary eq-deletefile keys (hashset of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as using parquet format based on #2642 ,It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need.

moon-fall avatar Jan 20 '22 02:01 moon-fall

A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys to filter out unnecessary eq-deletefile keys (hashset of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as using parquet format based on #2642 ,It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need.

image image The DeleteFilter object that include deletefile is using much more memory when compaction. As you said 'each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering', so I think we should control the DeleteFilter size to limit memory use.

fengsen-neu avatar Jan 21 '22 06:01 fengsen-neu

A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys to filter out unnecessary eq-deletefile keys (hashset of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as using parquet format based on #2642 ,It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need.

It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need. --- Could you please provide the PR for my reference?

fengsen-neu avatar Jan 25 '22 03:01 fengsen-neu

This is also a headache for us. We use RocksDBSet (reference #2680) but it is hard to tune the rocksdb, @moon-fall Could you please provide your PR? It matters to us.

coolderli avatar Jan 26 '22 01:01 coolderli

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 05 '22 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Aug 20 '22 00:08 github-actions[bot]