cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Mark `DataFrame.dtypes` as `_external_only_api`

Open vyasr opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. DataFrame.dtypes is used in many places in the code. For pandas compatibility, this method constructs a pd.Series from the column dtypes. This construction introduces unnecessary overhead that could be avoided, especially because in many cases the output is immediately converted back to a list or a {colname: dtype} dict.

Describe the solution you'd like DataFrame.dtypes should be decorated with _external_only_api. All usage should be switched to instead use Frame._dtypes, which simply constructs a dict and avoids the unnecessary overhead. Here's a quick indication of the benefits:

In [1]: import cudf

In [2]: df = cudf.DataFrame({f"{i}": [i] for i in range(10)})

In [3]: %timeit df._dtypes
2.31 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit df.dtypes
165 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: df = cudf.DataFrame({f"{i}": [i] for i in range(100)})

In [6]: %timeit df._dtypes
13.9 µs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit df.dtypes
316 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Describe alternatives you've considered None

Additional context If there is any internal functionality that is actually relying on the output of dtypes being a Series, we should carefully consider whether that method should be reimplemented. There is almost no reason that a Series should be preferable to a dict internally.

vyasr avatar Aug 03 '22 23:08 vyasr