sparkling-water
sparkling-water copied to clipboard
Sparkling water (Internal Backend mode) model training stuck at 100% with OOM Error
In Sparkling water internal backend mode, model training got stuck at 100% for long time. Even for small dataset of around 100mb with 1 million records. When i checked the h2o logs, i can see the following logs Config:
-------------------------- ----------------------------------------------------------
2022-05-30T11:14:44.959904202Z H2O_cluster_uptime: 11 secs
2022-05-30T11:14:44.959914185Z H2O_cluster_timezone: Etc/UTC
2022-05-30T11:14:44.959921216Z H2O_data_parsing_timezone: UTC
2022-05-30T11:14:44.959927689Z H2O_cluster_version: 3.36.0.3
2022-05-30T11:14:44.959933843Z H2O_cluster_version_age: 3 months and 13 days !!!
2022-05-30T11:14:44.959939679Z H2O_cluster_name: sparkling-water-185_spark-7c1d24e974f64e2996e8e9caeab93780
2022-05-30T11:14:44.959945175Z H2O_cluster_total_nodes: 1
2022-05-30T11:14:44.959951075Z H2O_cluster_free_memory: 12 Gb
2022-05-30T11:14:44.959956626Z H2O_cluster_total_cores: 4
2022-05-30T11:14:44.960038434Z H2O_cluster_allowed_cores: 4
2022-05-30T11:14:44.960044174Z H2O_cluster_status: locked, healthy
2022-05-30T11:14:44.960050228Z H2O_connection_url: http://main-py-04ef778114ac2e76-driver-svc.spark.svc:54321
2022-05-30T11:14:44.960056131Z H2O_connection_proxy: null
2022-05-30T11:14:44.960061500Z H2O_internal_security: False
2022-05-30T11:14:44.960068528Z Python_version: 3.9.2 final
2022-05-30T11:14:44.960074206Z -------------------------- ----------------------------------------------------------
H2O Logs:
2022-05-30T11:16:51.791569112Z 05-30 11:16:51.627 10.171.128.84:54321 14 client-3-1 WARN org.apache.spark.rpc.netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
2022-05-30T11:16:57.389791213Z 05-30 11:16:57.389 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.5 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
2022-05-30T11:17:57.457602205Z 05-30 11:17:57.457 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.5 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
2022-05-30T11:18:57.470322209Z 05-30 11:18:57.469 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.5 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.65 GB OOM!
2022-05-30T11:19:57.561733511Z 05-30 11:19:57.561 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.5 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
2022-05-30T11:20:57.628072540Z 05-30 11:20:57.627 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.6 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
2022-05-30T11:21:57.670927835Z 05-30 11:21:57.670 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.5 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=11.69 GB OOM!
2022-05-30T11:22:57.768870907Z 05-30 11:22:57.768 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.5 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=11.68 GB OOM!
2022-05-30T11:23:57.822428018Z 05-30 11:23:57.821 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:160.2 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
2022-05-30T11:24:57.903366354Z 05-30 11:24:57.902 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.6 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=11.68 GB OOM!
2022-05-30T11:25:57.974622238Z 05-30 11:25:57.973 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.6 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=11.68 GB OOM!
2022-05-30T11:26:58.089590722Z 05-30 11:26:58.089 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.6 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
2022-05-30T11:27:30.276656597Z 05-30 11:27:30.275 10.171.128.84:54321 14 client-3-1 WARN org.apache.spark.rpc.netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
2022-05-30T11:27:31.153629746Z 05-30 11:27:31.152 10.171.128.84:54321 14 eartbeater WARN org.apache.spark.executor.Executor: Issue communicating with driver in heartbeater
2022-05-30T11:27:31.153717997Z org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
2022-05-30T11:27:31.153726180Z at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
2022-05-30T11:27:31.153733030Z at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
2022-05-30T11:27:31.153739120Z at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
2022-05-30T11:27:31.153744544Z at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
2022-05-30T11:27:31.153749376Z at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
2022-05-30T11:27:31.153754170Z at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
2022-05-30T11:27:31.153758640Z at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:996)
2022-05-30T11:27:31.153763100Z at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
2022-05-30T11:27:31.153767724Z at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
2022-05-30T11:27:31.153772240Z at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
2022-05-30T11:27:31.153776700Z at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
2022-05-30T11:27:31.153781136Z at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
2022-05-30T11:27:31.153785681Z at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
2022-05-30T11:27:31.153790339Z at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
2022-05-30T11:27:31.153795260Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
2022-05-30T11:27:31.153799797Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2022-05-30T11:27:31.153821732Z at java.base/java.lang.Thread.run(Unknown Source)
2022-05-30T11:27:31.153825938Z Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
2022-05-30T11:27:31.153830402Z at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
2022-05-30T11:27:31.153834613Z at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
2022-05-30T11:27:31.153838500Z at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
2022-05-30T11:27:31.153843534Z at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
2022-05-30T11:27:31.153847861Z ... 12 more
2022-05-30T11:27:58.126288707Z 05-30 11:27:58.125 10.171.128.84:54321 14 FJ-2-1 WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.6 MB + FREE:10.75 GB == MEM_MAX:12.00 GB), desiredKV=23.61 GB OOM!
Any help on this issue? Thanks
@gurumoorthy208524 in internal mode memory is shared with Spark - that message we're seeing indicates that there was an out of memory error catched in H2O-3, not sure what is your setup when it comes to Spark and what exactly you're doing there but that might give you some direction - how much memory the Spark driver and executors have etc.
@krasinski the following is my spark config
conf["spark.driver.cores"]=2
conf["spark.driver.memory"]="2g"
conf["spark.executor.cores"]=4
conf["spark.executor.memory"]="12g"
conf["spark.executor.instances"]=1
@gurumoorthy208524, it's difficult to help you if we don't have a full context. For internal backend, Please always send us the full logs from driver and executors, not only the snippets. The other parts of logs also contains statistics about the dataset in memory and the details about the model that you're trying to train.
@gurumoorthy208524 Several observations from the logs:
- The frame doesn't seem to be parsed correctly, min and max values from the frame statistics:
2022-05-30T11:15:54.534422466Z 05-30 11:15:54.533 10.171.128.84:54321 14 1868849-69 INFO water.default: Parse result for frame_rdd_36795644202 (3644019 rows, 6 columns):
2022-05-30T11:15:54.841176497Z 05-30 11:15:54.840 10.171.128.84:54321 14 1868849-69 INFO water.default: ColV2 type min max mean sigma NAs constant cardinality
2022-05-30T11:15:54.841358567Z 05-30 11:15:54.841 10.171.128.84:54321 14 1868849-69 INFO water.default: repo_name: factor *CYGWIN* zzzzrrr/poly 116782
2022-05-30T11:15:54.841487028Z 05-30 11:15:54.841 10.171.128.84:54321 14 1868849-69 INFO water.default: ref: factor 33188 refs/heads/z 16 1406
2022-05-30T11:15:54.841900831Z 05-30 11:15:54.841 10.171.128.84:54321 14 1868849-69 INFO water.default: path: string 16
2022-05-30T11:15:54.841927474Z 05-30 11:15:54.841 10.171.128.84:54321 14 1868849-69 INFO water.default: mode: factor 33188 g')"") 23 5
2022-05-30T11:15:54.841960139Z 05-30 11:15:54.841 10.171.128.84:54321 14 1868849-69 INFO water.default: id: string 24
2022-05-30T11:15:54.842542829Z 05-30 11:15:54.841 10.171.128.84:54321 14 1868849-69 INFO water.default: symlink_target: factor !<symlink>ÿþ zoom-select. 3615148 13535
- Since the cardinality of the
repo_name
column is high (116782). Algorithms using one-hot encoding will will allocate a really big arrays and this might lead to the warning messages:
WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:1.09 GB + POJO:159.6 MB + FREE:10.75 GB == MEM_MAX:12.00 GB)
Can you try to exclude GLM and DeepLearning from your AutoML configuration?
Can you also share your code so we can try to reproduce?
cc @krasinski @sebhrusen @tomasfryda
closing because of no response for a long time