cudf
cudf copied to clipboard
cuDF - GPU DataFrame Library
cuDF - GPU DataFrames
data:image/s3,"s3://crabby-images/f0d63/f0d63a683f52734f724ab9425e77f61a62e2c360" alt=""
NOTE: For the latest stable README.md ensure you are on the main
branch.
Resources
- cuDF Reference Documentation: Python API reference, tutorials, and topic guides.
- libcudf Reference Documentation: C/C++ CUDA library API reference.
- Getting Started: Instructions for installing cuDF.
- RAPIDS Community: Get help, contribute, and collaborate.
- GitHub repository: Download the cuDF source code.
- Issue tracker: Report issues or request features.
Overview
Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
For example, the following snippet downloads a CSV, then uses the GPU to parse it into rows and columns and run calculations:
import cudf, requests
from io import StringIO
url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')
tips_df = cudf.read_csv(StringIO(content))
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100
# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())
Output:
size
1 21.729201548727808
2 16.571919173482897
3 15.215685473711837
4 14.594900639351332
5 14.149548965142023
6 15.622920072028379
Name: tip_percentage, dtype: float64
For additional examples, browse our complete API documentation, or check out our more detailed notebooks.
Quick Start
Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.
Installation
CUDA/GPU requirements
- CUDA 11.0+
- NVIDIA driver 450.80.02+
- Pascal architecture or better (Compute Capability >=6.0)
Conda
cuDF can be installed with conda (miniconda, or the full Anaconda distribution) from the rapidsai
channel:
For cudf version == 22.06
:
# for CUDA 11.0
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
cudf=22.06 python=3.9 cudatoolkit=11.0
# or, for CUDA 11.2
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
cudf=22.06 python=3.9 cudatoolkit=11.2
For the nightly version of cudf
:
# for CUDA 11.0
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
cudf python=3.9 cudatoolkit=11.0
# or, for CUDA 11.2
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
cudf python=3.9 cudatoolkit=11.2
Note: cuDF is supported only on Linux, and with Python versions 3.8 and later.
See the Get RAPIDS version picker for more OS and version info.
Build/Install from Source
See build instructions.
Contributing
Please see our guide for contributing to cuDF.
Contact
Find out more details on the RAPIDS site
Open GPU Data Science
data:image/s3,"s3://crabby-images/f0d63/f0d63a683f52734f724ab9425e77f61a62e2c360" alt=""
The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Apache Arrow on GPU
The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.