modin
modin copied to clipboard
Add GPU supports to MODIN
My name is Xiwen Zhang. And I am a PhD student at Gatech CS, working with Prof. Alexey Tumanov. Over the last 4 months, our team has been working on adding GPU supports to MODIN. We have finished about 50 APIs. And our preliminary evaluation results are promising. We would like to merge our code to the MODIN project. Next, in this issue, I will briefly talk about the major design choices we have made to add GPU supports to MODIN, and our plan for code merging.
Design Choices:
- We use Ray as the distributed execution engine and RAPIDS cuDF as the backend library.
- MODIN's architecture is well designed. It has a Pandas API layer, a query compiler layer, and an engine layer. We reuse the Pandas API layer, while implementing our own query compiler and engine.
- The biggest architecture change happens in the engine layer. MODIN's implementation relies on Ray's remote functions (tasks) and In-memory object store. It saves each dataframe partition into the object store and invokes remote function calls upon them. In our case, every partition is located in the device memory of a GPU, instead of the host memory. Ray doesn't have a in-gpu object store, and we cannot use the same mechanism as MODIN does. Instead, we initiate a Ray actor per GPU, and invoke remote actor method calls. Each partition still warps an object ID. And when invoking the remote actor method calls, we pass these object IDs as arguments. The returned object IDs will be wrapped in an array of new partitions, which forms a new distributed dataframe object.
Code Merging Plan (which files will be affected):
- We reuse MODIN's Pandas API layer. And we expect no changes needed on this layer.
- We will add a new module to the query compiler layer, modin/backends/cudf/.
- Some files in modin/data_management/functions/ may be affected.
- We will add a new module to the engine layer, modin/engines/ray/cudf_on_ray/. Besides data.py, partition_manager.py, partition.py and axis_partition.py, we will add a new file called gpu_manager.py, in which we define a Ray remote class (actor) for executing cuDF functions on a GPU device.
- We will try to reuse the code in modin/engine/base/ as much as we can. However, at the same time, this means the files in modin/engine/base/ may be affected by this code merging.
To make the job of code reviewing easier, we plan to merge incrementally. All the comments/feedback will be greatly appreciated. Also, feel free to ask me any question you may have.
Hello @xiwen1995!
First of all, I want to express my gratitude for the desire to combine your great work with MODIN.
When adding a new backend, your architectural decisions look exactly as we imagined. A quick note - adding a new backend should start by adding it under modin/experimental folder. So the paths will be as follows: modin/experimental/backends/cudf/ and modin/experimental/engines/ray/cudf_on_ray/.
Also of interest is the way of using a Ray actor per GPU and the results that you were able to obtain. Can you share some of them?
cc @devin-petersohn
Thanks @xiwen1995 and @anmyachev.
I don't think we need to add it to experimental. Like the Dask engine, we can just give a warning. The experimental flag is more designed around things that are still under heavy development and not usable for most people. I believe that the GPU support by @xiwen1995 and team is usable in its current state.
Thanks again @xiwen1995, @kvu35 and others! Let me know how I can help!
Thanks @devin-petersohn and @anmyachev. Yes, the GPU support is usable now, although only a subset of operators are covered.
Is there any instruction available to make use of this awesome feature?
On pure rapidsai environment, I do this in order to share tasks to CPU when the GPU is saturated
import cudf
cudf.set_allocator("managed")
@mhoangvslev We are working on the integration now! This functionality is for multi machine, multiple GPU support.
@mhoangvslev We are working on the integration now! This functionality is for multi machine, multiple GPU support.
Surely this would be helpful to use in hosted ML instances!
Any updates on this issue ?
@tianlinzx there is some experimental and partial cudf support in the Modin source, but I hear from @prutskov that it's mostly not working. Modin's CI doesn't test whether the cudf support is working. I don't have a cudf setup I can use to manually test either.
AFAIK no regular Modin contributors are trying to improve GPU support right now.
We'll leave this issue as the canonical one for GPU support.
Hi - Anyone still working on this or implementation is dead for now?
There is experimental support for Intel GPU through HDK - https://modin.readthedocs.io/en/stable/development/using_hdk.html#running-on-a-gpu, but it is not tested in CI.