polars icon indicating copy to clipboard operation
polars copied to clipboard

Standardize error message format

Open stinodego opened this issue 1 year ago • 10 comments

There is a lot of error handling across the Polars code base. We can do a better job of standardizing the formatting of the resulting error messages.

I would like to adhere to the following formatting rules:

  • The error message starts with a single sentence, without capitalization or trailing period.
  • If further elaboration is required, two newlines are added before adding further text with proper capitalization and punctuation.
  • Code snippets like function or parameter names should be enclosed in backticks. Class names can remain without backticks.
  • When adding format strings, use the !r representation, e.g. f"Column name: {name!r}"
  • Language should be American English, e.g. "normalize" instead of "normalise" etc.

For background and explanation of these rules, please see this excellent comment by @aldanor

Examples

Single line

raise IndexError("list index out of range")
# IndexError: list index out of range

raise TypeError(f"expected an integer for the `dtype` parameter, got {type(dtype).__name__!r}")
# TypeError: expected an integer for the `dtype` parameter, got 'str'

Multi-line

raise ValueError(
    "time zone not supported for the `start` parameter"
    "\n\nTry removing the time zone info by using `.dt.replace_time_zone`."
)
# ValueError: time zone not supported for the `start` parameter
#
# Try removing the time zone info by using `.dt.replace_time_zone`.

Maybe we should have a utility function for raising errors like we do for issuing deprecation warnings.

stinodego avatar Aug 11 '23 07:08 stinodego

I might have a stab at this over the weekend; sounds like a job that could be made eminently tractable with some carefully chosen regular expressions 🤣

(Consistent well-formatted error messages are definitely A Good Thing™).

alexander-beedie avatar Aug 11 '23 07:08 alexander-beedie

I gave this a try, just one point about the example mentioned. Does it make any difference:

raise ValueError("Got {type(dtype).__name__!r}")

And:

raise ValueError("Got {type(dtype).__name__}")

Since I guess type(dtype).__name__ will be string anyway.

aminalaee avatar Aug 14 '23 12:08 aminalaee

Does it make any difference

Yes it does! Just try it.

>>> word = 'hi'
>>> f"Word: {word}"
'Word: hi'
>>> f"Word: {word!r}"
"Word: 'hi'"

stinodego avatar Aug 14 '23 12:08 stinodego

Is this issue still relevant? I would like to start contributing to the project, but am actively learning Rust.

TheDataScientistNL avatar Oct 08 '23 10:10 TheDataScientistNL

Is this issue still relevant? I would like to start contributing to the project, but am actively learning Rust.

Yes it is. There's plenty of error messages not conforming yet. You can have a look both on the Rust and the Python side to see if you find areas to improve. Start out small!

stinodego avatar Oct 08 '23 19:10 stinodego

Is this issue still relevant? I would like to start contributing to the project, but am actively learning Rust.

Yes it is. There's plenty of error messages not conforming yet. You can have a look both on the Rust and the Python side to see if you find areas to improve. Start out small!

OK thanks! Will start on it.

TheDataScientistNL avatar Oct 08 '23 19:10 TheDataScientistNL

@stinodego Similar approach on the rust side? Would be nice to explore that side whilst doing some of these chores.

TheDataScientistNL avatar Oct 11 '23 14:10 TheDataScientistNL

Yes!

stinodego avatar Oct 11 '23 14:10 stinodego

I want to point out another problem that I see with this effort: Many error messages of the type "TypeAlias must be one of ..." don't get their arguments from the appropriate enum or TypeAlias, but rather they just specify it, e.g.,

raise ValueError(
    f"`time_unit` must be one of {{'ns', 'us', 'ms'}}, got {time_unit!r}"
)

Moreover, this f-string might get repeated in the module several times. Compare this with,

valid_format_names = get_args(TableFormatNames)
if format not in valid_format_names:
    raise ValueError(
        f"invalid table format name: {format!r}\nExpected one of: {', '.join(valid_format_names)}"
    )

How are you guys feeling about moving the error-message formatting or even the ValueError creation itself into type_aliases.py? Or alternatively, into a designated validations/errors module?

rancomp avatar Oct 19 '23 11:10 rancomp

I want to point out another problem that I see with this effort: Many error messages of the type "TypeAlias must be one of ..." don't get their arguments from the appropriate enum or TypeAlias, but rather they just specify it, e.g.,

raise ValueError(
    f"`time_unit` must be one of {{'ns', 'us', 'ms'}}, got {time_unit!r}"
)

Moreover, this f-string might get repeated in the module several times. Compare this with,

valid_format_names = get_args(TableFormatNames)
if format not in valid_format_names:
    raise ValueError(
        f"invalid table format name: {format!r}\nExpected one of: {', '.join(valid_format_names)}"
    )

How are you guys feeling about moving the error-message formatting or even the ValueError creation itself into type_aliases.py? Or alternatively, into a designated validations/errors module?

I think you make a fair point, but I guess it is up to the code owners to make such a decision, not me. I would think, if accepted, it is better of in a different issue? The task at hand is already quite large in terms of standardization

TheDataScientistNL avatar Oct 19 '23 16:10 TheDataScientistNL