Optimize generation of CombinedScanTask for RewriteDataFilesAction
In RewriteDataFilesAction, the default value of targetSizeInBytes is 128M, if there are the following data files: 20M, 20M, 20M, 70M, 100M,The current logic is to scan these data file in turn until the sum of the data file sizes <= targetSizeInBytes, So three CombinedScanTask tasks will be generated, (20M, 20M, 20M), (70M), (100M).
Obviously, it is more appropriate to generate two CombinedScanTask tasks (20M, 20M, 70M), (20M, 100M).
We should optimize this algorithm to generate as few target data files as possible and make its size as close to targetSizeInBytes as possible.
What alternative algorithm would you suggest? The current algorithm is simple and I'm sure could be improved.
My idea is to use the dynamic programming algorithm to get an optimal result, but I haven't implemented this algorithm yet. I will think about how to do it and do a test later
We should really get in my Medium Files PR first which would fix the suggested layout ^ (if the blocks were the right sizes) https://github.com/apache/iceberg/pull/3292
On Wed, Nov 3, 2021 at 4:11 AM kingeasternsun @.***> wrote:
It seems like a Knapsack problem
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/1667#issuecomment-958765565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADE2YLKP7BZPIGYRMXAS6DUKD4EPANCNFSM4TALJMXA .
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'