cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Offer more control over CPU fallback in cudf.pandas

Open bdice opened this issue 1 year ago • 7 comments

Is your feature request related to a problem? Please describe. The default execution model for cudf.pandas is to try to execute an operation on the GPU, then fall back to the CPU if it fails for any reason. This approach is desirable for end-users to maximize the number of cases where cudf.pandas "just works", but it makes it difficult to analyze when failures are occurring and why. The former can be addressed by running under the profiler, but that is more cumbersome than we would like in many cases where we would rather get a quick signal in the form of failure (e.g. when running a workflow or a test suite to analyze unsupported cases). Furthermore, there is no easy way to determine whether cudf and pandas return the same results for a given operation, which is a different failure mode that is currently not possible to capture.

Describe the solution you'd like We should generalize _fast_slow_function_call to support a wider range of fallback options. These options could be configurable by an environment variable, or by some global configuration option (the former is probably fine to start with). The different behaviors we would want to support are:

  • Error on fallback. We could then run the pandas test suite with this turned on and get a sense of how many tests cudf passes on its own.
  • Error on specific types of fallback. This would allow us to analyze the types of fallback that are occurring. Some of the most obvious error modes I can foresee (there are certainly others) are:
    • Out of memory errors, for the sake of planning No OOM related work
    • AttributeErrors for missing functionality
    • TypeErrors for differing function signatures
  • Error when cudf and pandas produce different outputs. This would be an extra branch within the fast path where the slow path is run even if the fast path succeeds, and then the fast and slow paths are compared for equivalence.

We may want to support warning instead of raising errors in some cases, but I don't think that's critical to start.

Describe alternatives you've considered This could be configured by the cudf.pandas profiler, or a similar context manager?

Additional context Feedback from @ianozsvald and @lmeyerov would be welcome!

bdice avatar Feb 06 '24 17:02 bdice

A python Warning object so we can do managed handling would make sense

Note we are not cudf.pandas users but cudf, so our interest would be seeing the same thing there

lmeyerov avatar Feb 07 '24 05:02 lmeyerov

@lmeyerov cudf doesn't fall back to CPU so you'd never see this with normal cudf usage. Only cudf.pandas has CPU fallback behavior. Can you clarify what you mean?

bdice avatar Feb 07 '24 18:02 bdice

Re:cudf, Some reason I thought a few cudf methods will fall back to CPU, like in parsing or others, rather than throwing NotImpl or a warning

Seperately / more broadly, there are some perf gotchas in cudf like where it makes copies or sorts that good code would avoid. A perf tips flag/mode that warns in these cases would be helpful for us, not just for the CPU fallback case. But that is a bigger story.

lmeyerov avatar Feb 07 '24 21:02 lmeyerov

Good feedback! There are a few cases in I/O where cudf does not offer a GPU-accelerated reader/writer for every format. That's the only exception I can think of right now where cudf executes CPU-only code (it copies to device and returns a GPU dataframe at the end). Those are documented in the notes on this page: https://docs.rapids.ai/api/cudf/stable/user_guide/io/io/

I can think of a few algorithms where cudf has cut down on extraneous copies/sorting over the last few releases (like drop_duplicates). If any specific cases come to mind, please file issues for those! We're aiming to reduce intermediate memory usage in cudf and these would likely align with that goal (in addition to improving performance).

bdice avatar Feb 07 '24 21:02 bdice

Yes, my meta is perf warnings mode, like when defaults are slow for conformance reasons and a special calling pattern would make faster, would be very helpful :)

lmeyerov avatar Feb 07 '24 22:02 lmeyerov

  • Error when cudf and pandas produce different outputs. This would be an extra branch within the fast path where the slow path is run even if the fast path succeeds, and then the fast and slow paths are compared for equivalence.

If it's okay with you @mroeschke, can I still work on this component since it covers the issue I opened?

Matt711 avatar May 22 '24 18:05 Matt711

If it's okay with you @mroeschke, can I still work on this component since it covers the https://github.com/rapidsai/cudf/issues/15817 I opened?

Yes go for it @Matt711!

mroeschke avatar May 22 '24 18:05 mroeschke

We could have two debugging mode options (note: we can use different names):

  1. mode.pandas_debugging
  2. mode.fallback_debugging

(1.) is for when fallback does not occur. It checks that the results from cudf and pandas agree and returns a warning if they do not. I'm working on that option in this PR #15837 .

(2.) is for when fallback does occur. It could return errors on the specific types of fallback mentioned:

  • Out of memory errors, for the sake of planning No OOM related work
  • AttributeErrors for missing functionality
  • TypeErrors for differing function signatures

What do we think about these two options?

cc. @bdice @vyasr @wence-

Matt711 avatar May 29 '24 17:05 Matt711

Making these modes independently configurable is definitely what we want, yes. As I commented on this in #15837, though, I don't think options are the right way to expose this. options are user-facing, whereas what we're trying to accomplish here is something for developers. Some environment variables documented in the developer guide are probably closer to what I would envision, especially for the first one (pandas_debugging). I don't see a reason for a user to ever need that one. I could envision exposing some internal APIs to control the second case (fallback_debugging) because in that scenario it could be useful to have the profiler hook into these so that users could collect information on why fallback occurred.

vyasr avatar May 30 '24 01:05 vyasr

Using an environment variable instead of an option is fine with me. I am curious if you have a more specific place in mind in the Developer Guide for documenting the environment variable?

Matt711 avatar May 30 '24 13:05 Matt711

Maybe we can add a new section on the fast-slow-proxy wrapping scheme. It can be mostly stubbed out and we can add info.

wence- avatar May 30 '24 14:05 wence-

Maybe we can add a new section on the fast-slow-proxy wrapping scheme. It can be mostly stubbed out and we can add info.

Yes, and I could add that in a new cudf.pandas section in the Developer Guide?

Matt711 avatar May 30 '24 14:05 Matt711

@Matt711 what's the status of this issue after #16562? Next steps would be to work on enabling the various different fallback modes suggested in the issue I think (which in turn would help us do more systematic analysis of fallback).

vyasr avatar Nov 05 '24 20:11 vyasr

@Matt711 what's the status of this issue after #16562? Next steps would be to work on enabling the various different fallback modes suggested in the issue I think (which in turn would help us do more systematic analysis of fallback).

Thanks for the reminder! I'll create a PR that raises on specific kinds of fallback, which I think should close this issue.

Matt711 avatar Nov 05 '24 21:11 Matt711