NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[Task] Large Joins

Open bschifferer opened this issue 3 years ago • 1 comments

What needs doing Improve user experience to learn how to do large joins. Customer feedback is that JoinExternal suggest that the operator can be used for joins between two large dataframes. In reality, this can cause (silent) OOM issues.

  1. Define what is the recommended way for large joins (NVTabular, cuDF, dask_cudf, Spark, etc)?
  2. What is the limit of JoinExternal?
  3. Document the limits of JoinExternal and recommended way to execute joins of two large data frames
  4. Update notebooks when JoinExternal is used - why it works and in which cases it wont work

bschifferer avatar Oct 11 '21 13:10 bschifferer

@bschifferer, can you please follow up on this task? We should have good guidance on how to do large joins (NVT, cuDF, dask_cudf, or Spark). These are what I'd suggest. Feel free to share feedback.

  1. Benchmark JoinExternal with a large dataset (start with Criteo, but create a bigger synthetic data if necessary) against cuDF, dask_cudf, and Spark
  2. Record performance and limitations (e.g. NVT raises OOM)
  3. Add guidance to NVTabular doc on how to perform JoinExternal (e.g. use NVT for large joins or use Spark for joins with data exceeding GPU memory)

@EvenOldridge for vis.

jsohn-nvidia avatar Dec 06 '22 16:12 jsohn-nvidia