filecrush
filecrush copied to clipboard
Add support for filecrushing on Elastic MapReduce
Work-in-progress PR - do not pull yet
Hi @edwardcapriolo - this is an open pull request to add support for using filecrush on EMR.
There are three main things to fix:
- Instantiating the right type of
FileSystem
- Fix the location of
tmpDir
- I think we should be referencing "${hadoop.tmp.dir}" rather than rawnew Path("tmp/crush-" + UUID.randomUUID());
- Replacing the
fs.makeQualified(dir).toUri().getPath()
pattern with something that doesn't strip important S3 bucket information #1 is done, see PR. #2 is doable. #3 is a bit harder - I am working through this for EMR, but might need some help from you to make sure my changes don't break filecrush on standard HDFS.
Hoping this is the start of a collaboration! We're really excited about filecrush here at Snowplow.
It all looks good so far. Just let me know when you want me to merge.
We ended up not using this library in the end. :-) You can merge as-is if you like, or close. I'll delete our fork in a few days.