dumbo
dumbo copied to clipboard
Integration Amazon EMR
Sounds like all that's needed is a new backend to talks to s3 file system and EMR jobflow control (via boto API).
Essential features:
- Read input from and write output to S3.
- Create new jobflow or reuse existing one.
- Options to specify number of instance and their types (e.g. m1.medium)
Nice to have:
- Automatic upload of local input files to S3.
- Change number of workers instances.
- Support to spot instances
- Resource estimator for future runs (e.g. try with a sample, figure how long it will take for the full thing).