cascalog
cascalog copied to clipboard
Interactive Cascalog cluster REPL
The new serfn for Cascalog 2.0 works great when the same var definitions exist on the tasks as the machine submitting the job. This is the case for normal ETL stuff where you run your queries using "hadoop jar ...".
When you're running queries interactively from a REPL though, this isn't the case. Any functions you define outside the lexical closure of your custom ops won't be available on the tasks.
We can fix this issue by making something like this:
(use 'cascalog.repl) (bootstrap-repl)
Or even better, just have the "hadoop jar myjar.jar cascalog.Repl" command set everything up automatically.
This should cause Cascalog to capture the source code of anything defined at the REPL, and then execute that source in the appropriate namespaces on the tasks before those tasks load up their operations.
This will make the cluster REPL experience of Cascalog identical to the local experience.