NVTabular
NVTabular copied to clipboard
[BUG] NVTabular runs into OOM or dies when scaling to large dataset
Describe the bug I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.
Error 1: Workers just die one after one Characteristic:
- Dataset size: ~200 million rows
- ~200 columns
- some categorify ops, some minmaxnormalization, some lambda ops
2022-09-15 12:59:09,526 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7fc9dd565e50>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48834 remote=tcp://127.0.0.1:36323> already closed.
2022-09-15 12:59:09,533 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7f6c60fe9940>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48832 remote=tcp://127.0.0.1:36323> already closed.
Error 2: Run into OOM Workflow:
features1 = (
[['col1', 'col2']] >>
nvt.ops.Categorify()
)
features2 = (
['col3'] >>
nvt.ops.Categorify(
num_buckets=10_000_000
)
)
targets = ['target1', 'target2']
features = features1+features2+targets
Characteristics:
- ~35000 files, ~350GB parquet, 15 billion rows
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
@benfred , please check with @bschifferer on this
@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.
Any idea what could be happening here?
I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 explains two reasons why the fit
could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).
@bschifferer - I'd like to explore if https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)
@bschifferer , please update the status of this ticket. Are we workign on this data set now ?