cascalog
cascalog copied to clipboard
job-level settings not being passed on to jobs
(as from the mailing list: https://groups.google.com/forum/#!topic/cascalog-user/Rq_O33VsDyc )
I've come across similar issues of the options for child JVMs specified in with-job-conf not "sticking". I experienced GC issues in a reducer of one of my Cascalog jobs for the first time last week. I found the with-job-conf macro and wrapped the query execution form with it, to no avail:
(let [snk-qry-by-chan (for [chan channels]
(channel-query chan))
all-snk-qry-seq (apply concat snk-qry-by-chan)]
;; configure the MapReduce child JVM options to avoid GC Overhead Limit err
(with-job-conf {"mapred.child.java.opts" "-XX:-UseGCOverheadLimit -Xmx4g"}
;; execute all of the queries in parallel
(apply ?- all-snk-qry-seq)))
The relevant parts of my project.clj
:dependencies [[org.clojure/clojure "1.5.1"]
[cascalog "1.10.1"]
[incanter "1.4.1"]]
:repositories {"cloudera" "https://repository.cloudera.com/artifactory/cloudera-repos"}
:profiles {:provided {:dependencies [[org.apache.hadoop/hadoop-core "0.20.2-cdh3u5"]]}}
But from the logging output from the reducer in question, regardless of what I specified in with-job-conf, I always saw this:
2013-07-12 17:25:55,216 INFO cascading.flow.hadoop.FlowMapper: child jvm opts: -Xmx1073741824
Further details:
- We're running a Cloudera distribution (v 4.1.4) of Hadoop, and the version of Hadoop is 2.0.0.
- I'm running Cascalog in cluster mode (I uberjar the code whenever I deploy).
- The exception being thrown from the JVM is a GC Overhead Limit exceeded (as opposed to something like OutOfMemoryError).
- (new detail as of 7/18/13) I've noticed that the with-job-conf does pass at least some other jobconf settings. The only example I've noticed clearly is, in the with-job-conf map, I had a key of "io.compression.codecs" and the value was a string containing "com.hadoop.compression.lzo.LzopCodec", which does not exist on our installation, and I got an error .
I saw Robin's workaround, which seems to just modify the site-hadoop.xml. It would be great if the with-job-conf settings "stuck" so as not to have to tweak site settings for per-job needs (especially since I don't manage the Hadoop cluster).
I've noticed (perhaps?) related issues in pure Cascading. Configuration properties supplied to the FlowConnector don't always get passed into the JobConf, the behaviour seems inconsistent and unpredictable. Would be good to have visibility and explicit guaranteed control over the JobConf.