NVTabular
NVTabular copied to clipboard
[Task] Large Joins
What needs doing
Improve user experience to learn how to do large joins. Customer feedback is that JoinExternal
suggest that the operator can be used for joins between two large dataframes. In reality, this can cause (silent) OOM issues.
- Define what is the recommended way for large joins (NVTabular, cuDF, dask_cudf, Spark, etc)?
- What is the limit of
JoinExternal
? - Document the limits of
JoinExternal
and recommended way to execute joins of two large data frames - Update notebooks when
JoinExternal
is used - why it works and in which cases it wont work
@bschifferer, can you please follow up on this task? We should have good guidance on how to do large joins (NVT, cuDF, dask_cudf, or Spark). These are what I'd suggest. Feel free to share feedback.
- Benchmark JoinExternal with a large dataset (start with Criteo, but create a bigger synthetic data if necessary) against cuDF, dask_cudf, and Spark
- Record performance and limitations (e.g. NVT raises OOM)
- Add guidance to NVTabular doc on how to perform JoinExternal (e.g. use NVT for large joins or use Spark for joins with data exceeding GPU memory)
@EvenOldridge for vis.