iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Optimize generation of CombinedScanTask for RewriteDataFilesAction

Open zhangjun0x01 opened this issue 5 years ago • 4 comments

In RewriteDataFilesAction, the default value of targetSizeInBytes is 128M, if there are the following data files: 20M, 20M, 20M, 70M, 100M,The current logic is to scan these data file in turn until the sum of the data file sizes <= targetSizeInBytes, So three CombinedScanTask tasks will be generated, (20M, 20M, 20M), (70M), (100M).

Obviously, it is more appropriate to generate two CombinedScanTask tasks (20M, 20M, 70M), (20M, 100M).

We should optimize this algorithm to generate as few target data files as possible and make its size as close to targetSizeInBytes as possible.

zhangjun0x01 avatar Oct 27 '20 06:10 zhangjun0x01

What alternative algorithm would you suggest? The current algorithm is simple and I'm sure could be improved.

rdblue avatar Oct 28 '20 23:10 rdblue

My idea is to use the dynamic programming algorithm to get an optimal result, but I haven't implemented this algorithm yet. I will think about how to do it and do a test later

zhangjun0x01 avatar Oct 29 '20 01:10 zhangjun0x01

We should really get in my Medium Files PR first which would fix the suggested layout ^ (if the blocks were the right sizes) https://github.com/apache/iceberg/pull/3292

On Wed, Nov 3, 2021 at 4:11 AM kingeasternsun @.***> wrote:

It seems like a Knapsack problem

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/1667#issuecomment-958765565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADE2YLKP7BZPIGYRMXAS6DUKD4EPANCNFSM4TALJMXA .

RussellSpitzer avatar Nov 03 '21 14:11 RussellSpitzer

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Feb 28 '24 00:02 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Mar 13 '24 00:03 github-actions[bot]