NNAnalytics
NNAnalytics copied to clipboard
Use block size from HDFS configuration for Large Files calculation
In NNA today, particularly if you look around here: https://github.com/paypal/NNAnalytics/blob/b17e8e6d91fd853b23f67a0b3ed0c5c95c2d8788/src/main/java/org/apache/hadoop/hdfs/server/namenode/cache/SuggestionsEngine.java#L161-L165
You will see that NNA uses a hardcoded cut off of 128 Megabyte block sizes to distinguish between "Medium Files" and "Large Files".
We should instead utilize the bytes count from dfs.blocksize
value found in hdfs-site.xml
(Configuration object in NNA, programmatically) passed into NNA that came from the source cluster.
What about making the file size definitions user configurable here as it's reasonable to expect differing opinions on what constitutes a particular file size from users. Currently file sizes are defined: tiny > 0 && tiny <= 1024 small > 1024 && small <= 1048576 medium > 1048576 && medium <= 134217728 large > 134217728 There would be value in being able to examine the farther end of the scale a bit more granularly. Importing tables from RDBMS' can result in files 10's & 100's of GBs in size, for example.
Good idea, maybe a web UI where the user can select different filter to sort/group the files would be a better interface
Hmm, yes a good idea @americanyorkie -- something to keep in mind though is that those are cached results you see -- so while yes that is possible to change them however the result may not be reflected until the next SuggestionEngine run.
Still though; this is probably fine.
I can see an admin-only REST endpoint that would set these. For example, a naive one like: /setAttributes?tiny=1024&small=1048576&medium=134217728
(assuming then that large is anything greater than 134217728).
Thoughts?
I don't think it will be possible to have different settings per user though... We could certainly add a "gigantic file" category too. 😆
I still think this is best to be fetched from the HDFS configuration file (hdfs-site.xml) as that should be the same value used by the active NameNode. If a different value is desired then it can be changed for just the NNA hosts hdfs-site.xml.
Changing this value on the fly will not be good for NNA so it needs to be a hard value decided on bootstrap time.
An additional justification is that once NNA bootstraps from a cluster NameNode (Observer or Standby), it will anyway have expected configuration.
More thoughts on this one -- I think we should give a statistic by which we measure tiny, small, and medium files.
I think ratios are probably the best measure here. If we were to retain the same hardcoded values then... assuming hdfs-site.xml
has a blocksize of 128MB...
Large files = Greater than blocksize
Medium files = Greater than or equal to 1/128 of blocksize but less than large files
Small files = Greater than or equal to 1/131072 of blocksize but less than medium files
Tiny files = Greater than 0 bytes but less than small files
The ratios aren't very intuitive however. Might be better to stick with the hardcoded 1KB and 1MB sizes. Just dumping thoughts.