beavis
beavis copied to clipboard
Pandas helper functions
beavis
Pandas and Dask helper methods.
See this blog post for a full description on how to use Beavis to test your code.
Intall
Install with pip: pip install beavis
test helpers
These test helper methods are meant to be used in test suites. They provide descriptive error messages to allow for a seamless development workflow.
The test helpers are inspired by chispa and spark-fast-tests, popular test helper libraries for the Spark ecosystem.
There are built-in Pandas testing methods that can also be used, but they don't provide error messages that are as easy to parse. The following sections compare the built-in Pandas output and what's output by Beavis, so you can choose for yourself.
Column comparisons
The built-in assert_series_equal
method does not make it easy to decipher the rows that are equal and the rows that are different, so quickly fixing your tests and maintaining flow is hard.
Here's the built-in error message when comparing series that are not equal.
df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])
> ???
E AssertionError: Series are different
E
E Series values are different (50.0 %)
E [index]: [0, 1, 2, 3]
E [left]: [1042, 2, 9, 6]
E [right]: [5, 2, 7, 6]
Here's the beavis error message that aligns rows and highlights the mismatches in red.
import beavis
beavis.assert_pd_column_equality(df, "col1", "col2")
You can also compare columns in a Dask DataFrame.
ddf = dd.from_pandas(df, npartitions=2)
beavis.assert_dd_column_equality(ddf, "col1", "col2")
The assert_dd_column_equality
error message is similarly descriptive.
DataFrame comparisons
The built-in pandas.testing.assert_frame_equal
method doesn't output an error message that's easy to understand, see this example.
df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
pd.testing.assert_frame_equal(df1, df2)
E AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
E
E DataFrame.iloc[:, 0] (column name="col1") values are different (50.0 %)
E [index]: [0, 1]
E [left]: [1, 2]
E [right]: [5, 2]
beavis provides a nicer error message.
beavis.assert_pd_equality(df1, df2)
DataFrame comparison options:
-
check_index
(defaultTrue
) -
check_dtype
(defaultTrue
)
Let's convert the Pandas DataFrames to Dask DataFrames and use the assert_dd_equality
function to check they're equal.
ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
beavis.assert_dd_equality(ddf1, ddf2)
These DataFrames aren't equal, so we'll get a good error message that's easy to debug.
If the dtypes aren't equal assert_dd_equality
will give you another error message that's also easy to understand.
Dask helpers
Create a Dask DataFrame:
import beavis
ddf = beavis.create_ddf(
[[1, "a"], [2, "b"], [3, "c"], [4, "d"]], ["nums", "letters"]
)
Print all the partitions in the DataFrame.
beavis.print_partitions(ddf)
nums letters
0 1 a
1 2 b
nums letters
2 3 c
3 4 d
create_ddf
creates Dask DataFrames with two partitions by default.
Print the human readable dtypes of the DataFrame.
print(ddf.dtypes)
nums int64
letters object
dtype: object
Print the dtypes using code formatting.
beavis.print_dtypes(ddf)
# {'nums': 'int64', 'letters': 'object'}
You can get a high level overview of how data is distributed in a cluster with partition_report
.
beavis.partition_report(ddf)
It'll output a report like this:
Total number of partitions: 1095
Total DataFrame memory: 101.71 kiB
Num partitions < 1 KB: 1095 (100.0%)
Num partitions < 1 MB: 1095 (100.0%)
Num partitions > 1 GB: 0 (0.0%)
Development
Install Poetry and run poetry install
to create a virtual environment with all the Beavis dependencies on your machine.
Other useful commands:
-
poetry run pytest tests
runs the test suite -
poetry run black .
to format the code -
poetry build
packages the library in a wheel file -
poetry publish
releases the library in PyPi (need correct credentials)