cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[ENH] Use `strict=True` argument to `zip` once py39 support is dropped

Open wence- opened this issue 1 year ago • 1 comments

In many places in the cudf code we zip two (or more) iterables together with the assumption/precondition that they are all of equal length. The pattern is (approximately):

names: list[str]
columns: list[Column]
data: dict[str, Column] = dict(zip(names, columns))

This has, by design of zip, a potential problem lurking in that if the two inputs are not of equal length, we only keep the first min(len(names), len(columns)) objects.

To avoid check this at user-input boundaries we need to be careful to check (and then raise if not) that the inputs we are zipping do have equal length. This is easy to forget.

Python 3.10 introduces a new keyword-argument to zip, strict, which can be used to assert that the inputs have the same length. We should consider using this across the code-base when no longer supporting Python 3.9.

wence- avatar May 23 '24 09:05 wence-

+1. Also there's a ruff rule that would allow us to enforce this https://docs.astral.sh/ruff/rules/zip-without-explicit-strict/

mroeschke avatar May 23 '24 15:05 mroeschke

https://github.com/rapidsai/cudf/pull/16637 partially addressed this. There remains more work to do since that PR only changed this in the cudf-polars package, whereas we still need to make this change throughout the rest of the repo.

vyasr avatar Aug 29 '24 00:08 vyasr