glob list instead of simple hdfs list and pattern support for input
Right now hraven accepts a simple hdfs path as input folder and will fetch all job history + conf files underneath it. This pull request adds support for specifying a pattern with wildcards (*) and using hdfs api's globStatus method to list files instead of hraven's recursive listFiles method. This way one can easily shard hraven's job to different years/months/days.
Interesting idea. Doesn't the RM already do this (sharding history files by date) ? For Hadoop 1 we had the original directory all in one place (where the history server can read from), then we separately ran JobFilePartitioner to shard the files into a yyyy/mm/dd directory structure. Are you doing a different setup ? Can you explain how your history files appear in one place and then get shared, or how that works for you ?
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
Angad Singh seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.