polars
polars copied to clipboard
Implement GPU support
Description
@stinodego closed the previous issue that mentioned GPU support and said there were no plans to support GPU in the future but was open to a feature request. Seeing as author of the previous issue didn't open a new feature request, I'm doing so instead.
We're using Pandas for a data processing pipeline and there were long-standing plans to migrate to Polars as our data volume is expected to increase in Q2-Q3 of 2024. Seeing as CuDF is available, it's unlikely that we'll go ahead with those original plans.
I'm still a proponent of Polars, as there are obvious improvements over Pandas. However, the migration doesn't make sense now that RapidsAI offer comparable/ faster speeds with no code change unless Polars also offers GPU support.
In the long run, it's difficult envisage any widely used data processing library which do not offer GPU accelerated compute. Namely if the speed problem can be solved by buying relatively inexpensive consumer-grade hardware, spending developers' time (considerably more expensive) on rewriting code would be less attractive option. Hence this feature request.
Right now, to get the performance that's mentioned in the Rapids cuDF blog post, you seem to require the Nvidia Grace Hopper chip, which costs above USD 10,000 (IIRC). See this link.
So unless you're okay with that upfront added cost, I think the migration of Pandas to Polars would be the more economically feasible option.
Planning 10 years ahead is great, and I'm pretty sure nobody can really predict with high accuracy what the state of affairs will be then, but it doesn't seem prudent to wait around until then to upgrade your current Pandas based pipeline.
I will repeat my reply to the previous issue here:
We do not have concrete plans to implement ways to leverage the GPU in the near future.
For some context, you can see Ritchie's answer to this question during the EuroSciPy 2023 conference: https://youtu.be/GTVm3QyJ-3I?t=3199
I will keep this issue open for further discussion.
Right now, to get the performance that's mentioned in the Rapids cuDF blog post, you seem to require the Nvidia Grace Hopper chip, which costs above USD 10,000 (IIRC). See this link.
CuDF runs on the RTX series as well, which is consumer grade. There are benchmarks online that show ~60x speed up using RTX4080 over the same pandas operations running on intel i9. You should benchmark it against Polars if you have an Nvidia GPU -- results are comparable.
Planning 10 years ahead is great, and I'm pretty sure nobody can really predict with high accuracy what the state of affairs will be then, but it doesn't seem prudent to wait around until then to upgrade your current Pandas based pipeline.
Thanks, for our uses we're going to switch to CuDF in late Q1 of next year. FWIW, this feature request is not for my benefit as there's no chance GPU support will be rolled in Polars out by when we need it, and even if it is, there is no chance that I can convince my team to do a rewrite to get similar performance as switching to a GPU cluster.
There are benchmarks online that show ~60x speed up using RTX4080 over the same pandas operations running on intel i9
could you share this please? just tried searching but didn't find it
There are benchmarks online that show ~60x speed up using RTX4080 over the same pandas operations running on intel i9
could you share this please? just tried searching but didn't find it
Looks like it's a 30x speed up instead of 60 and it's a 4070 Ti not a 4080. For some reason I remembered CuDF taking 1 second instead of 2: https://youtu.be/9KsJRyZJ0vo?si=d3yPHQZbMkiBWylP&t=1053
At any rate, there are lots of community benchmarks out there, you guys can google as well as I can. Better yet, do your own benchmarks and post it in the readme.
GPU support might come. It seems a perfect project for a certain provider of hardware accelerated products. But that's also what it is. A project on itself. The polars project itself will not work on GPU acceleration, but will do its best to enable it.
I'm still a proponent of Polars, as there are obvious improvements. However, the migration doesn't make sense now that RapidsAI offer comparable/ faster speeds with no code change, unless Polars also offers GPU support.
We'd like to argue that there are benefits IN the code change. ;)
On consumer-grade GPU, VRAM is a big problem, and it will be embarrassing if data is large.
It is more appropriate to use DSL
in the professional field. Currently in the field of quantitative research, the WorldQuant Alpha101 Formula
is a classic example. Some examples are as follows
Alpha#002 (-1 * correlation(rank(delta(log(volume), 2)), rank(((close - open) / open)), 6))
Alpha#028 scale(((correlation(adv20, low, 5) + ((high + low) / 2)) - close))
Alpha#101 ((close - open) / ((high - low) + .001))
Formulas are suitable for quickly verifying ideas, and it requires tools to translate into polars or pandas, so I developed a translation tool.
https://github.com/wukan1986/expr_codegen
You can refer to it to implement the cudf
version, or use cudf.pandas
directly
You can try it online https://exprcodegen.streamlit.app/
Sorry, my English is poor, so the comments and documents are in Chinese. You can use the translation tool provided by your browser to read them.
Apple silicon changes the game here. The unified mememory model is how folks can run large AI models on consumer hardware for inference. Check out the metal framework. I feel like any query engine that isn’t leveraging that power will be obsolete in 2025/26…
On consumer-grade GPU, VRAM is a big problem, and it will be embarrassing if data is large.
It is more appropriate to use
DSL
in the professional field. Currently in the field of quantitative research, theWorldQuant Alpha101 Formula
is a classic example. Some examples are as followsAlpha#002 (-1 * correlation(rank(delta(log(volume), 2)), rank(((close - open) / open)), 6)) Alpha#028 scale(((correlation(adv20, low, 5) + ((high + low) / 2)) - close)) Alpha#101 ((close - open) / ((high - low) + .001))
Formulas are suitable for quickly verifying ideas, and it requires tools to translate into polars or pandas, so I developed a translation tool.
https://github.com/wukan1986/expr_codegen You can refer to it to implement the
cudf
version, or usecudf.pandas
directlyYou can try it online https://exprcodegen.streamlit.app/
Sorry, my English is poor, so the comments and documents are in Chinese. You can use the translation tool provided by your browser to read them.
There's a paper I read where they (researchers at Apple) demonstrated an alternative approach to use flash rather than RAM memory, in context of LLMs: https://arxiv.org/abs/2312.11514
Recent relevant blog post: https://pola.rs/posts/polars-on-gpu/