accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Provide ability to track a bulk imported file

Open ivakegg opened this issue 4 years ago • 4 comments

This is similar to #650 but different enough I thought it warranted a separate ticket. The is related to the 1.x versions.

Basically the problem is being able to absolutely verify that a bulk imported file was successfully loaded into the system. This requires being able to determine what the file is renamed to during the bulk import process. Given that information we would be able to scan the accumulo.metadata table to find its matching entry. We realize that there is a race condition here in which the GC could have removed it before verification could take place. That situation could be handled by looking in the GC logs which is not very clean but doable. We could of course monitor the master log to determine the file mapping as well but I was hoping for a cleaner solution.

One possibility is to actually include the name of the original file in the key or value within the file column family of the accumulo metadata. Another possibility is to have the master pass back the list of file name mappings to the client. The later could be achieved by writing a mapping file into the directory that was being imported or alternatively the failure directory.

ivakegg avatar Mar 10 '21 19:03 ivakegg

Perhaps a trace mechanism akin to a scan trace is doable where we can trace the life of a file through the accumulo system. That could include everything from initial mapping through to deletion. This would be a large undertaking but perhaps worth the effort if we can minimize the performance impact.

ivakegg avatar Mar 10 '21 19:03 ivakegg

Created an initial pull request #1964 which shows the generation of a simple JSON encoded mapping file.

ivakegg avatar Mar 10 '21 21:03 ivakegg

I would like this ticket to stand as something to consider which is to create a mechanism akin to trace which allows us to track a file through the system; from bulk import through compaction through garbage collection.

ivakegg avatar Mar 24 '21 12:03 ivakegg

#1480 added a TabletLogger class to centralize management of logging events for tablet activity. A similar logger utility class could be used for centralized tracking of log messages pertaining to file activity.

ctubbsii avatar Mar 25 '21 23:03 ctubbsii