NNAnalytics Use block size from HDFS configuration for Large Files calculation

In NNA today, particularly if you look around here: https://github.com/paypal/NNAnalytics/blob/b17e8e6d91fd853b23f67a0b3ed0c5c95c2d8788/src/main/java/org/apache/hadoop/hdfs/server/namenode/cache/SuggestionsEngine.java#L161-L165

You will see that NNA uses a hardcoded cut off of 128 Megabyte block sizes to distinguish between "Medium Files" and "Large Files".

We should instead utilize the bytes count from dfs.blocksize value found in hdfs-site.xml (Configuration object in NNA, programmatically) passed into NNA that came from the source cluster.

Jun 25 '18 16:06 pjeli

What about making the file size definitions user configurable here as it's reasonable to expect differing opinions on what constitutes a particular file size from users. Currently file sizes are defined: tiny > 0 && tiny <= 1024 small > 1024 && small <= 1048576 medium > 1048576 && medium <= 134217728 large > 134217728 There would be value in being able to examine the farther end of the scale a bit more granularly. Importing tables from RDBMS' can result in files 10's & 100's of GBs in size, for example.

Sep 11 '18 06:09 americanyorkie

Good idea, maybe a web UI where the user can select different filter to sort/group the files would be a better interface

Sep 11 '18 08:09 akshatgit

Hmm, yes a good idea @americanyorkie -- something to keep in mind though is that those are cached results you see -- so while yes that is possible to change them however the result may not be reflected until the next SuggestionEngine run.

Still though; this is probably fine.

I can see an admin-only REST endpoint that would set these. For example, a naive one like: /setAttributes?tiny=1024&small=1048576&medium=134217728 (assuming then that large is anything greater than 134217728).

Thoughts?

I don't think it will be possible to have different settings per user though... We could certainly add a "gigantic file" category too. 😆

Sep 11 '18 17:09 pjeli

I still think this is best to be fetched from the HDFS configuration file (hdfs-site.xml) as that should be the same value used by the active NameNode. If a different value is desired then it can be changed for just the NNA hosts hdfs-site.xml.

Changing this value on the fly will not be good for NNA so it needs to be a hard value decided on bootstrap time.

Feb 28 '19 00:02 pjeli

An additional justification is that once NNA bootstraps from a cluster NameNode (Observer or Standby), it will anyway have expected configuration.

Mar 08 '19 19:03 pjeli

More thoughts on this one -- I think we should give a statistic by which we measure tiny, small, and medium files.

I think ratios are probably the best measure here. If we were to retain the same hardcoded values then... assuming hdfs-site.xml has a blocksize of 128MB...

Large files = Greater than blocksize Medium files = Greater than or equal to 1/128 of blocksize but less than large files Small files = Greater than or equal to 1/131072 of blocksize but less than medium files Tiny files = Greater than 0 bytes but less than small files

The ratios aren't very intuitive however. Might be better to stick with the hardcoded 1KB and 1MB sizes. Just dumping thoughts.

Mar 15 '19 23:03 pjeli

NNAnalytics NNAnalytics copied to clipboard

Use block size from HDFS configuration for Large Files calculation

NNAnalytics
NNAnalytics copied to clipboard