NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] NVTabular runs into OOM or dies when scaling to large dataset

Open bschifferer opened this issue 2 years ago • 5 comments

Describe the bug I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.

Error 1: Workers just die one after one Characteristic:

  • Dataset size: ~200 million rows
  • ~200 columns
  • some categorify ops, some minmaxnormalization, some lambda ops
2022-09-15 12:59:09,526 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7fc9dd565e50>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
  File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48834 remote=tcp://127.0.0.1:36323> already closed.
2022-09-15 12:59:09,533 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7f6c60fe9940>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
  File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48832 remote=tcp://127.0.0.1:36323> already closed.

Error 2: Run into OOM Workflow:

features1 = (
    [['col1', 'col2']] >> 
    nvt.ops.Categorify()
)

features2 = (
    ['col3'] >>
    nvt.ops.Categorify(
        num_buckets=10_000_000
    )
)

targets = ['target1', 'target2']
features = features1+features2+targets

Characteristics:

  • ~35000 files, ~350GB parquet, 15 billion rows
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

bschifferer avatar Sep 23 '22 13:09 bschifferer

@benfred , please check with @bschifferer on this

viswa-nvidia avatar Sep 26 '22 23:09 viswa-nvidia

@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.

EvenOldridge avatar Oct 17 '22 23:10 EvenOldridge

Any idea what could be happening here?

I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 explains two reasons why the fit could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).

rjzamora avatar Oct 18 '22 00:10 rjzamora

@bschifferer - I'd like to explore if https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)

rjzamora avatar Oct 21 '22 17:10 rjzamora

@bschifferer , please update the status of this ticket. Are we workign on this data set now ?

viswa-nvidia avatar Apr 11 '23 17:04 viswa-nvidia