polars icon indicating copy to clipboard operation
polars copied to clipboard

Implement GPU support

Open u3Izx9ql7vW4 opened this issue 1 year ago • 9 comments

Description

@stinodego closed the previous issue that mentioned GPU support and said there were no plans to support GPU in the future but was open to a feature request. Seeing as author of the previous issue didn't open a new feature request, I'm doing so instead.

We're using Pandas for a data processing pipeline and there were long-standing plans to migrate to Polars as our data volume is expected to increase in Q2-Q3 of 2024. Seeing as CuDF is available, it's unlikely that we'll go ahead with those original plans.

I'm still a proponent of Polars, as there are obvious improvements over Pandas. However, the migration doesn't make sense now that RapidsAI offer comparable/ faster speeds with no code change unless Polars also offers GPU support.

In the long run, it's difficult envisage any widely used data processing library which do not offer GPU accelerated compute. Namely if the speed problem can be solved by buying relatively inexpensive consumer-grade hardware, spending developers' time (considerably more expensive) on rewriting code would be less attractive option. Hence this feature request.

u3Izx9ql7vW4 avatar Dec 19 '23 01:12 u3Izx9ql7vW4

Right now, to get the performance that's mentioned in the Rapids cuDF blog post, you seem to require the Nvidia Grace Hopper chip, which costs above USD 10,000 (IIRC). See this link.

So unless you're okay with that upfront added cost, I think the migration of Pandas to Polars would be the more economically feasible option.

Planning 10 years ahead is great, and I'm pretty sure nobody can really predict with high accuracy what the state of affairs will be then, but it doesn't seem prudent to wait around until then to upgrade your current Pandas based pipeline.

avimallu avatar Dec 19 '23 06:12 avimallu

I will repeat my reply to the previous issue here:

We do not have concrete plans to implement ways to leverage the GPU in the near future.

For some context, you can see Ritchie's answer to this question during the EuroSciPy 2023 conference: https://youtu.be/GTVm3QyJ-3I?t=3199

I will keep this issue open for further discussion.

stinodego avatar Dec 19 '23 10:12 stinodego

Right now, to get the performance that's mentioned in the Rapids cuDF blog post, you seem to require the Nvidia Grace Hopper chip, which costs above USD 10,000 (IIRC). See this link.

CuDF runs on the RTX series as well, which is consumer grade. There are benchmarks online that show ~60x speed up using RTX4080 over the same pandas operations running on intel i9. You should benchmark it against Polars if you have an Nvidia GPU -- results are comparable.

Planning 10 years ahead is great, and I'm pretty sure nobody can really predict with high accuracy what the state of affairs will be then, but it doesn't seem prudent to wait around until then to upgrade your current Pandas based pipeline.

Thanks, for our uses we're going to switch to CuDF in late Q1 of next year. FWIW, this feature request is not for my benefit as there's no chance GPU support will be rolled in Polars out by when we need it, and even if it is, there is no chance that I can convince my team to do a rewrite to get similar performance as switching to a GPU cluster.

u3Izx9ql7vW4 avatar Dec 19 '23 17:12 u3Izx9ql7vW4

There are benchmarks online that show ~60x speed up using RTX4080 over the same pandas operations running on intel i9

could you share this please? just tried searching but didn't find it

MarcoGorelli avatar Dec 19 '23 17:12 MarcoGorelli

There are benchmarks online that show ~60x speed up using RTX4080 over the same pandas operations running on intel i9

could you share this please? just tried searching but didn't find it

Looks like it's a 30x speed up instead of 60 and it's a 4070 Ti not a 4080. For some reason I remembered CuDF taking 1 second instead of 2: https://youtu.be/9KsJRyZJ0vo?si=d3yPHQZbMkiBWylP&t=1053

At any rate, there are lots of community benchmarks out there, you guys can google as well as I can. Better yet, do your own benchmarks and post it in the readme.

u3Izx9ql7vW4 avatar Dec 19 '23 17:12 u3Izx9ql7vW4

GPU support might come. It seems a perfect project for a certain provider of hardware accelerated products. But that's also what it is. A project on itself. The polars project itself will not work on GPU acceleration, but will do its best to enable it.

I'm still a proponent of Polars, as there are obvious improvements. However, the migration doesn't make sense now that RapidsAI offer comparable/ faster speeds with no code change, unless Polars also offers GPU support.

We'd like to argue that there are benefits IN the code change. ;)

ritchie46 avatar Dec 19 '23 19:12 ritchie46

On consumer-grade GPU, VRAM is a big problem, and it will be embarrassing if data is large.

It is more appropriate to use DSL in the professional field. Currently in the field of quantitative research, the WorldQuant Alpha101 Formula is a classic example. Some examples are as follows

Alpha#002   (-1 * correlation(rank(delta(log(volume), 2)), rank(((close - open) / open)), 6))
Alpha#028    scale(((correlation(adv20, low, 5) + ((high + low) / 2)) - close))
Alpha#101   ((close - open) / ((high - low) + .001))

Formulas are suitable for quickly verifying ideas, and it requires tools to translate into polars or pandas, so I developed a translation tool.

https://github.com/wukan1986/expr_codegen You can refer to it to implement the cudf version, or use cudf.pandas directly

You can try it online https://exprcodegen.streamlit.app/

Sorry, my English is poor, so the comments and documents are in Chinese. You can use the translation tool provided by your browser to read them.

wukan1986 avatar Dec 20 '23 02:12 wukan1986

Apple silicon changes the game here. The unified mememory model is how folks can run large AI models on consumer hardware for inference. Check out the metal framework. I feel like any query engine that isn’t leveraging that power will be obsolete in 2025/26…

rupurt avatar Jan 07 '24 04:01 rupurt

On consumer-grade GPU, VRAM is a big problem, and it will be embarrassing if data is large.

It is more appropriate to use DSL in the professional field. Currently in the field of quantitative research, the WorldQuant Alpha101 Formula is a classic example. Some examples are as follows

Alpha#002   (-1 * correlation(rank(delta(log(volume), 2)), rank(((close - open) / open)), 6))
Alpha#028    scale(((correlation(adv20, low, 5) + ((high + low) / 2)) - close))
Alpha#101   ((close - open) / ((high - low) + .001))

Formulas are suitable for quickly verifying ideas, and it requires tools to translate into polars or pandas, so I developed a translation tool.

https://github.com/wukan1986/expr_codegen You can refer to it to implement the cudf version, or use cudf.pandas directly

You can try it online https://exprcodegen.streamlit.app/

Sorry, my English is poor, so the comments and documents are in Chinese. You can use the translation tool provided by your browser to read them.

There's a paper I read where they (researchers at Apple) demonstrated an alternative approach to use flash rather than RAM memory, in context of LLMs: https://arxiv.org/abs/2312.11514

feynon avatar Jan 08 '24 06:01 feynon

Recent relevant blog post: https://pola.rs/posts/polars-on-gpu/

owenprough-sift avatar Apr 05 '24 13:04 owenprough-sift