Tom Augspurger comments

Results 1078 comments of


                                            Tom Augspurger

pd.crosstab, categorical data and missing instances

Yeah, it'd be great if you can take a shot. But first, let's see if @jreback and @jorisvandenbossche agree that the documented version is correct, and that treating `dropna` differently...

pd.crosstab, categorical data and missing instances

In this case, crosstab is a `pivot_table` followed up by a `fillna(0)`: ```python table = df.pivot_table('__dummy__', index=rownames, columns=colnames, aggfunc=len, margins=margins, dropna=dropna) table = table.fillna(0).astype(np.int64) ``` so the change to `pivot_table`'s...

pd.crosstab, categorical data and missing instances

> so how is it possible (apart from using categoricals) that a value is not present? I guess it can happen if you have multiple levels, some of which aren't...

Configurable blocksize mode for streaming executor in unit tests

The bad news: 196 tests fail with this. I'm starting to work through those in the issues linked from https://github.com/rapidsai/cudf/issues/18928. Fixes for those will probably go in separate PRs. But...

Configurable blocksize mode for streaming executor in unit tests

Quick status update here: We have two more PRs in the works that fix the last two correctness issues: - https://github.com/rapidsai/cudf/pull/19196 - https://github.com/rapidsai/cudf/pull/19187 And then I'll push an update here...

Configurable blocksize mode for streaming executor in unit tests

I noticed that there's a `assert_sink_result_equal`, so I've added the same `blocksize_mode` argument in https://github.com/rapidsai/cudf/pull/19146/commits/38fb2cbdb7043deea73f35d72b3663d38266bbe5. That required adjusting the assert function a bit (`scan_csv` won't automatically scan a folder named...

Configurable blocksize mode for streaming executor in unit tests

/merge

[BUG]: NA values incorrectly filled with `False` in String ops with streaming executor and multiple partitions

Here are some failing tests. ```diff diff --git a/python/cudf_polars/tests/test_scan.py b/python/cudf_polars/tests/test_scan.py index 922321830b..d27f8f96dd 100644 --- a/python/cudf_polars/tests/test_scan.py +++ b/python/cudf_polars/tests/test_scan.py @@ -462,3 +462,21 @@ def test_scan_csv_without_header_and_new_column_names_raises(df, tmp_path): make_partitioned_source(df, path, "csv", write_kwargs={"include_header": False}) q...

[BUG]: NA values incorrectly filled with `False` in String ops with streaming executor and multiple partitions

That still fails: ``` ================================================================================================================================================================= FAILURES ================================================================================================================================================================== _____________________________________________________________________________________________________________________________________________________________ test_string_na_na _____________________________________________________________________________________________________________________________________________________________ E AssertionError: Series are different (exact value mismatch) [left]: ['a', 'a', None, 'a', 'b', None, 'b', 'b'] [right]: ['a', 'a',...

[BUG]: NA values incorrectly filled with `False` in String ops with streaming executor and multiple partitions

Simplifying the reproducer a bit: ```python import cudf_polars.containers import polars as pl import polars.testing.asserts df = pl.DataFrame({"a": ["a", None]}) # this is buggy result = cudf_polars.containers.DataFrame.from_polars(df.tail(1)).to_polars() expected = df.tail(1) polars.testing.asserts.assert_frame_equal(result,...