accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Additional improvements to the du command,

Open EdColeman opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. As a follow-on to https://github.com/apache/accumulo/pull/1259, there may be ways to improved the du command performance, possibly at the sacrifice of some accuracy.

Describe the solution you'd like Ideally the du command could return nearly as fast as the hadoop directory usage command.

Additional context There may be a conflict between how much space an Accumulo table needs - vs. how much space is it occupying on disk. Compactions in progress can increase the hdfs usage because they are created and stored on disk, but not yet part of the table "used" files. Files that are eligible for gc, but not deleted also inflate the hdfs space.

The metadata stores estimated entity counts and files size that is updated on a compaction. For bulk imports, the information may not be available.

For a large table, does it matter? You have a big number vs another bigger number. In most cases, the physically hdfs usage seems the most relevant. If the Accumulo size is a driving factor and a accurate number is needed - then you probably should run a compaction so that old data can be aged off, shared files consolidated,...At that point (barring additional bulk imports) the metadata values should be accurate and may be good enough.

The du command accounts for shared files - if a table is cloned, it is a metadata operation and the file references in the metadata are "copied" - so you cannot use the file size in hdfs under the table directory because the table could be sharing files that are in a different tables directory tree.

Possible changes:

  • if no files are shared, just use the hdfs (hdfs -dus option) directory size of the table.
  • Maybe use the file entity / size estimates and then add the files size of bulk import files?
  • provide options to the command just get the entity / size estimates from the metadata or just the hdfs directory size and allow the used to figure out what they needed in the first place. (or show both and wish them luck...)

EdColeman avatar Jul 18 '22 16:07 EdColeman

@EdColeman - I agree that making some changes here makes a lot of sense to improve performance. I have a background in Apache Kafka and it's similar there when trying to compute the message count and disk usage/size of a Topic and partitions (similar to tablets in Accumulo). It's an intensive process to have to walk to the partitions and figure out the number of messages/disk usage so estimating and caching sizes/metadata is usually sufficient because it doesn't much matter the exact count/bytes when talking about large enough numbers as long as it's close.

I am happy to work on this issue and make the changes once we get some feedback from everyone and come to a consensus about what changes should be done here.

@milleruntime, @ctubbsii , @dlmarion - Thoughts on this?

cshannon avatar Jul 18 '22 17:07 cshannon

If no one has any objects I am volunteering to go ahead and start with the 3rd suggestion listed and work on an option to scan the Metadata for the size information as that seems like a logical thing to try. I can provide a PR for that and see what people think. I may have a little bit of time later this week to get started but I am out next week so I probably won't have something to submit to be reviewed until at least the first week of August.

cshannon avatar Jul 18 '22 18:07 cshannon

@EdColeman -

Yesterday/today I spent a good amount of time diving into the Scan api and implementation between the client and server to get a better feel how that works and then I also started working on this a bit. I have a branch with a rough prototype/proof of concept that is a work in progress here: https://github.com/cshannon/accumulo/commits/accumulo-2820

It's not ready for a real review yet as there's more work to be done but you can take a look if you get a chance and see the direction I'm going. I had a couple questions/comments and wanted to get your thoughts.

  1. The metadata table scan could technically just be done by the client without an RPC call but I kept the current way of sending an RPC request and letting the server do it inside of the TableDiskUsage class. I think this is much better as it keeps the current design intact and is a simpler update plus this utility already scans metadata for the file names to use for the HDFS iterator. So it can simply be updated to read the sizes from metadata instead and the client/shell code can more or less work the same without many modifications.
  2. I created new disk usage RPC call which same as the old with a new method parameter. This will allow passing any options we want to custom the du command when sent to the server for processing. The main thing now is a Mode enum which just currently has FILE, DIRECTORY, METADATA. The idea is the user running the command could specify how they want to compute the size. The documentation will describe the benefits/drawbacks of each mode. FILE is just the current default way of scanning the HDFS files, DIRECTORY would be for using the hdfs -dus command (not implemented yet in my prototype) and METADATA would be just scanning the metadata table. Having the options parameter and enum for mode will allow us to easily expand in the future with any flags or settings we want to compute usage.
  3. I still need to update things to handle scanning the root table if someone wants to know the metadata table size itself.
  4. I haven't looked at bulk import stuff yet but that could be another mode or just be included automatically, not sure.
  5. Tests of course will still need to be updated and done.

cshannon avatar Aug 13 '22 14:08 cshannon

After talking offline to @EdColeman I am going to try a different approach of creating a new command called du_meta that will be simpler and just do the metadata scan on the client side and not touch the RPC call or the current command. The metadata stores the HDFS file sizes and this seems like a faster option for getting the output by doing the client side scan vs having to send an RPC call. The help for the command can mention the fact that for the most accurate info that a flush/compaction should be run so metadata is updated.

The old command can stay for now and eventually be marked as deprecated and removed if this new one performs better as expected. I also plan to add some flags to support different options such as computing the size of a single Table or all tables in a namespace, etc.

cshannon avatar Aug 19 '22 15:08 cshannon

@EdColeman - new working branch for the client side only version is here if you want to take a look: https://github.com/cshannon/accumulo/commits/accumulo-2820-client

This is by no means finished and still very much a work in progress and I will continue with it later this week, probably Friday. The output still needs work as some stuff is missing and it needs cleaning up and refinement. I did add a verbose flag to spit out the breakdown when data is shared across tables just like the original command does when using a pattern or namespace. Besides working on code cleanup and fixing up the output (maybe with more flags to do extra things to customize the output) there are still no tests yet since I'm still experimenting so hopefully I can get to that when I am back to working on it at the end of the week so I can get a PR (or at least a draft PR) ready for review.

Also a note about shared file linking/detection:

Right now when linking shared files only one scan is done, so like the original command, shared files are only going to be detected if the table provided or looked at is referencing files from another table and doesn't "own" thtem. So lets say you have table1 and table2 is a clone and they share table1's file. If you ran du currently on table1 then there would be no indication that there was a shared file as only files that come back are owned by table 1. but if you ran a du on table2 then the file that comes back points to table1 so you know its shared. If we want to be able to show shared files between tables in any direction and no matter which table you query then we'd need to scan again for tables that match each file.

With the same example you'd need to scan for table1 and get the list of files back that it owned and then scan again with those files to see if any other tables have a hit to associate it. This is not a problem if we use a pattern (regex) and the pattern matches both tables already as when we scan we hit both tables and know they are associated which is why the output of the command for the original "du" command looks a bit different depending on if you give it a single table or a regex.

I think we are ok with this behavior as the old command works the same way and adding extra scans would add complexity and a performance hit that I don't think is necessary. Ultimately it really depends how you want to define shared. One way to look at it is in the example above table1 is not using shared files from table2 as it owns the files, only table2 has shared files being used from table1.

cshannon avatar Aug 20 '22 16:08 cshannon

See my comments on the PR.

ctubbsii avatar Sep 01 '22 23:09 ctubbsii