cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Diversify libcudf's exceptions to simplify exception handling

Open vyasr opened this issue 2 years ago • 13 comments

Is your feature request related to a problem? Please describe. Currently the vast majority of exceptions thrown by libcudf are cudf::logic_error due to the use of the CUDF_EXPECTS and CUDF_FAIL macros. These are in large part used for input validation. Aside from cudf::logic_error, the only other exceptions that are thrown with any frequency are std::bad_alloc or cuda_error. While the exception payloads are typically sufficiently informative for direct users of the C++ API, they are less useful for users of higher-level language libraries that rely on libcudf. In particular, many cuDF Python functions rely on parsing the exception payloads to translate C++ exceptions into suitable exception types on the Python side.

Describe the solution you'd like It would be helpful for libcudf to make use of a wider range of standard exceptions to provide more informative exceptions for higher-level libraries to parse. From the Python perspective, it would be especially valuable if libcudf threw exceptions that matched the exceptions that Cython already knows how to translate into Python. While it's not clear that all of these would be useful, at least a few of them certainly could be. For example, a number of cases would be nicely handled if libcudf threw a std::invalid_argument rather than a cudf::logic_error since Cython will automatically translate the former into a Python ValueError. One way to accomplish this would be to allow CUDF_EXPECTS to accept a third parameter indicating the exception type to throw. An example of this already exists in rmm with the corresponding RMM_EXPECTS macro.

For cases where no existing mapping exists, it would be helpful if exception payloads were standardized in some way so that the messages could be parsed more easily. For instance, a very simple approach might be for CUDF_EXPECTS to stringify and embed the name of the exception into the message. It is also possible for the Python layer to introduce a custom exception handler, so we could do that if cudf introduces additional custom exceptions. I am open to systematically embedding additional information in the message as well if others think of useful ways to do so.

Describe alternatives you've considered libcudf could also use some sort of logging framework rather than relying entirely on exceptions. Such a framework would be significantly more work to implement, but it would have the benefit of also offering additional information. cuDF Python uses warnings in some non-failing cases to provide information to users, and a C++ logging framework could provide information that we could translate to Python warnings.

Additional context I'm not familiar with how our JNI code currently handles this or whether there are any utilities to simplify translation of C++ exceptions into Java. Ideally libcudf would adopt an exception handling pattern that is suitable for use with all of the higher level libraries that rely on it, so it would be valuable to get some feedback from someone with more experience with cudf's Java interface.

vyasr avatar Feb 02 '22 23:02 vyasr

CC @jlowe @revans2 for a Java perspective. AFAICT everything in the JNI just uses a single CudfException class. I don't know if there's any automatic way to generate a wider set of Java exceptions from C++ exceptions through features of the Native Interface.

vyasr avatar Feb 02 '22 23:02 vyasr

One way to accomplish this would be to allow CUDF_EXPECTS to accept a third parameter indicating the exception type to throw.

Note that RMM's version already has this: https://github.com/rapidsai/rmm/blob/0515ca47bc6abdc4619a099976bdbf059cbde624/include/rmm/detail/error.hpp#L87-L120

From the Python perspective, it would be especially valuable if libcudf threw exceptions that matched the exceptions that Cython already knows how to translate into Python.

I totally support this so long as we have a precise list of the exact exceptions you'd want libcudf to throw and precisely when you'd want them to be thrown.

it would be helpful if exception payloads were standardized in some way so that the messages could be parsed more easily

I do not support this. The what() message is not the correct channel to carry programmatic information. I think the fact that we are currently parsing exception message strings is a fundamentally flawed approach and we need to get away from that ASAP as opposed to leaning further into it.

It is also possible for the Python layer to introduce a custom exception handler, so we could do that if cudf introduces additional custom exceptions.

This is the correct approach. If you want to know about a situation that occurred that doesn't correspond to one of the exceptions Cython already knows about, then libcudf should introduce a custom exception type that cudf knows to catch that type.

jrhemstad avatar Feb 02 '22 23:02 jrhemstad

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 05 '22 00:03 github-actions[bot]

I have been looking into both C++ and Python exceptions. Here is a brief summary of the investigation.

C++ Exceptions: https://en.cppreference.com/w/cpp/error/exception Python Exceptions: https://docs.python.org/3/library/exceptions.html#bltin-exceptions

There are 26 C++ exceptions and 42 Python exception (+ or - depending on how you count them). Before looking at the Cython mappings, I managed to line up 7 of their 11 mappings. I missed both ArithmeticErrors, both TypeErrors. The only error I "mapped" that Cython doesn't was the SystemError.

These are the recommended Cython mappings: image

And this is an "exploded" version to highlight the missing C++ exceptions:

C++ Python
logic_error
invalid_argument ValueError
domain_error ValueError
length_error
out_of_range IndexError
future_error(C++11)
bad_optional_access(C++17)
runtime_error RuntimeError
range_error ArithmeticError
overflow_error OverflowError
underflow_error ArithmeticError
regex_error(C++11)
system_error(C++11) SystemError
ios_base::failure(C++11) IOError
filesystem::filesystem_error(C++17)
nonexistent_local_time(C++20)
ambiguous_local_time(C++20)
format_error(C++20)
bad_typeid TypeError
bad_cast TypeError
bad_any_cast(C++17)
bad_weak_ptr(C++11)
bad_function_call(C++11)
bad_alloc MemoryError
bad_array_new_length(C++11)
bad_exception
bad_variant_access(C++17)

Probably the place to start is just by tacking a single pairing, such as std::invalid_argument and ValueError.

codereport avatar Mar 28 '22 23:03 codereport

Probably the place to start is just by tacking a single pairing

I think the place to start is to find where C++ exception what() messages are currently being parsed from Python in order to raise the proper Python exception. That kind of exception string parsing is very bad and we should seek to eliminate it ASAP.

jrhemstad avatar Mar 29 '22 03:03 jrhemstad

One consideration that recently popped up on the Spark side is differentiating errors that are recoverable by reattempting the task vs. errors that are fatal to the process. For example, CUDA's cudaErrorIllegalAddress and some other CUDA errors will cause every subsequent CUDA API call to return an error and thus are effectively fatal to the process. It would be nice if there was a specific exception subclass/interface that could be caught for these type of exceptions.

Failing that, it would be nice to have an exception specific to CUDA errors that includes the CUDA error code so the application can discern these "fatal" errors when appropriate. Today CUDA errors are encoded in the exception string rather than accessible directly within the exception.

jlowe avatar Mar 29 '22 13:03 jlowe

It would be nice if there was a specific exception subclass/interface that could be caught for these type of exceptions.

This would be nice, but it is harder than it sounds because CUDA doesn't really document all that well which errors are like this and I believe it can depend on the context in which they occurred.

it would be nice to have an exception specific to CUDA errors that includes the CUDA error code

Definitely. I could have sworn we already did that, but I guess not.

jrhemstad avatar Mar 29 '22 13:03 jrhemstad

This would be nice, but it is harder than it sounds because CUDA doesn't really document all that well which errors are like this and I believe it can depend on the context in which they occurred.

Hm, one way we could hack around this is to call cudaGetLastError twice and if the second invocation doesn't return cudaSuccess (and returns the same error as the first) then it is nearly certain that a sticky error occurred. I say "nearly" because it is possible that some other asynchronous error occurred between the two calls to cudaGetLastError.

A heavier hammer would be to do a cudaDeviceSynchronize() after the cudaGetLastError().

So like this:

void throw_on_cuda_error(cudaError_t error){
   if(cudaSuccess != error){
       cudaGetLastError();                      // Clear any lingering cudart error
       auto const last = cudaGetLastError();    // Check if cudart error still exists (sticky)
       bool const is_sticky = (error == last) 
                              and (last == cudaDeviceSynchronize()); // Sync to ensure "last" wasnt an async error
       throw is_sticky ? sticky_error{""} : cudart_error{""}; 
   }
}

jrhemstad avatar Mar 29 '22 14:03 jrhemstad

Probably the place to start is just by tacking a single pairing

I think the place to start is to find where C++ exception what() messages are currently being parsed from Python in order to raise the proper Python exception. That kind of exception string parsing is very bad and we should seek to eliminate it ASAP.

My observations of this kind of parsing is why I suggested starting with the invalid_exception->ValueError map. I suspect that most cases where we are currently doing this are where libcudf is doing a CUDF_FAIL on an input parameter, and that the most common case of those right now is mapping to a Python ValueError due to an invalid argument. I could also imagine there being some cases of OverflowError, IndexError, and/or TypeError popping up. A command like grep -C 5 -R "except RuntimeError" python/cudf/cudf --include="*.py" should give a reasonable starting point of places where we are catching a C++ generated RuntimeError and reraising as a different type of error.

vyasr avatar Mar 29 '22 16:03 vyasr

it would be nice to have an exception specific to CUDA errors that includes the CUDA error code

Definitely. I could have sworn we already did that, but I guess not.

We do have cuda_error, which is thrown by CUDA_TRY. It sticks the error into the error message, though. Perhaps now would be a good time to address the TODO requesting that the error code be encoded in the exception itself?

vyasr avatar Mar 29 '22 16:03 vyasr

I filed https://github.com/rapidsai/cudf/issues/10553 separately as it is orthogonal to the rest of the exception improvements.

jrhemstad avatar Mar 31 '22 14:03 jrhemstad

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 11 '22 22:05 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Aug 09 '22 23:08 github-actions[bot]

Between #12076, #12078, and #12426 enough work has been done to consider this discussion closed. I will open a new issue to discuss follow-up work in a more task-oriented format to better measure completion.

vyasr avatar Mar 06 '23 18:03 vyasr