accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

External Compaction Progress is inaccurate

Open ddanielr opened this issue 1 year ago • 3 comments

Describe the bug When an external compactor reports progress, it reports back the number of entries written and reports that as a percentage of number of total entries.

However, when compaction jobs contain bulk import files, their number of entries value is 0. This means that compaction jobs will still total up the number of entries written and report back a percentage greater than 100%.

This renders the progress percentage to be inaccurate.

Versions (OS, Maven, Java, and others, as appropriate):

  • Affected version(s) of this project: [e.g. 1.10.0] 2.1
  • OS: [e.g. CentOS 7.5]
  • Others:

To Reproduce Steps to reproduce the behavior (or a link to an example repository that reproduces the problem):

  1. Trigger an external compaction job against a table that contains files which were bulk imported
  2. Review the compaction-coordinator log (or monitor's external compactions#Running Compactions page) to see percentages greater than 100% being reporting while the the job status is still "In Progress"

Expected behavior The compaction progress percentage should be accurate and never report progress greater than 100%.

Additional context The number of estimated entries is coming from Bulk.FileInfo https://github.com/apache/accumulo/blob/33894e69979afc70efca448ea31fb29ac73288f3/core/src/main/java/org/apache/accumulo/core/clientImpl/bulk/Bulk.java#L107

It's likely that fixing the progress bar is a change to the bulk import code to correctly set the number of estimated entries. If the bulk Import code always sets that value to 0 then having it provides little benefit.

ddanielr avatar Mar 27 '24 14:03 ddanielr

This could also be solved by excluding the entries written from the progress bar if they are coming from bulk import files

ddanielr avatar Mar 27 '24 16:03 ddanielr

I can work on this

kevinrr888 avatar Mar 29 '24 18:03 kevinrr888

When the estimated entries is zero, the compactor process could open rfile index and sum up the entries for the range covered by the compactor. Thinking this would be a minimal change for 2.1.x. Modifying bulk import to compute the estimated entries would probably be a much larger change for 2.1.x

keith-turner avatar Apr 01 '24 17:04 keith-turner