filecrush icon indicating copy to clipboard operation
filecrush copied to clipboard

Add support for filecrushing on Elastic MapReduce

Open alexanderdean opened this issue 11 years ago • 2 comments

Work-in-progress PR - do not pull yet

Hi @edwardcapriolo - this is an open pull request to add support for using filecrush on EMR.

There are three main things to fix:

  1. Instantiating the right type of FileSystem
  2. Fix the location of tmpDir - I think we should be referencing "${hadoop.tmp.dir}" rather than raw new Path("tmp/crush-" + UUID.randomUUID());
  3. Replacing the fs.makeQualified(dir).toUri().getPath() pattern with something that doesn't strip important S3 bucket information #1 is done, see PR. #2 is doable. #3 is a bit harder - I am working through this for EMR, but might need some help from you to make sure my changes don't break filecrush on standard HDFS.

Hoping this is the start of a collaboration! We're really excited about filecrush here at Snowplow.

alexanderdean avatar Apr 07 '13 10:04 alexanderdean

It all looks good so far. Just let me know when you want me to merge.

edwardcapriolo avatar Apr 07 '13 18:04 edwardcapriolo

We ended up not using this library in the end. :-) You can merge as-is if you like, or close. I'll delete our fork in a few days.

alexanderdean avatar Dec 11 '14 14:12 alexanderdean