buckaroo
buckaroo copied to clipboard
Feat: limit / specify number of rows to display
Checks
- [X] I have checked that this enhancement has not already been requested
How would you categorize this request. You can select multiple if not sure
Display (is this related to visual display of a value)
Enhancement Description
As far as I understand, currently Data Frames are displayed in their entirety up to 10k rows, after which they are sampled to 10k rows and displayed.
This request is looking for argument to DFViewer
, or wherever makes the most sense, to limit the number of rows displayed to some n
, where 10k > n
>1.
While I understand that its possible to call DFViewer(df.head(10))
to only display 10 rows, this also only provides summary stats over those 10 rows. This request is looking for some behavior like below:
DFViewer(df, max_rows=10) # only displays 10 rows, show summary stats over entire / sampled df
If this is already possible my apologies.
Appreciative of this great tool!
Pseudo Code Implementation
NA
Prior Art
NA
Thanks for the interest. BuckarooWidget
and PolarsBuckarooWidget
have a facility for changing sampling behavior through inheritance. Sampling occurs before summary stats, and before serialization. The python side of Serialization is very slow. In the following code I modified the behavior of DFViewer
to accept a widget_klass
. I also made an implementation of BuckarooWidget
that uses a severely restrictive sampling_klass
.
Try this code snippet out.
I will definitely modify the DFViewer
function to accept a widget_klass
in an upcoming release.
I could add an option for configuring sampling behavor, but for now I'd like to wait. you can write your own utility function to build a sampling_klass
and assemble a DFViewer
as you see fit. What do you think about ergonomics one way vs the other?
from buckaroo.buckaroo_widget import RawDFViewerWidget, BuckarooWidget
from buckaroo.dataflow.widget_extension_utils import (configure_buckaroo)
from buckaroo.dataflow.dataflow_extras import Sampling
def DFViewer(df,
column_config_overrides=None,
extra_pinned_rows=None, pinned_rows=None,
extra_analysis_klasses=None, analysis_klasses=None,
widget_klass=BuckarooWidget):
"""
Display a DataFrame with buckaroo styling and analysis, no extra UI pieces
column_config_overrides allows targetted specific overriding of styling
extra_pinned_rows adds pinned_rows of summary stats
pinned_rows replaces the default pinned rows
extra_analysis_klasses adds an analysis_klass
analysis_klasses replaces default analysis_klass
"""
BuckarooKls = configure_buckaroo(
widget_klass,
extra_pinned_rows=extra_pinned_rows, pinned_rows=pinned_rows,
extra_analysis_klasses=extra_analysis_klasses, analysis_klasses=analysis_klasses)
bw = BuckarooKls(df, column_config_overrides=column_config_overrides)
dfv_config = bw.df_display_args['dfviewer_special']['df_viewer_config']
df_data = bw.df_data_dict['main']
summary_stats_data = bw.df_data_dict['all_stats']
return RawDFViewerWidget(
df_data=df_data, df_viewer_config=dfv_config, summary_stats_data=summary_stats_data)
df = pd.DataFrame({'a':[10, 20, 339, 887], 'b': ['foo', 'bar', None, 'baz']})
#DFViewer(df)
class TwoSample(Sampling):
pre_limit = 5
max_columns = 1
serialize_limit = 2
class TwoBuckaroo(BuckarooWidget):
sampling_klass = TwoSample
DFViewer(df, widget_klass=TwoBuckaroo)
Appreciate the speedy response!
I did try out the code snippet you shared, and while it looked promising, I wasn't able to produce the behavior I was looking for. Playing with pre_limit
and serialize_limit
did limit the amount of displayed rows, but it also altered the behavior of the sampling. In my test case, I have a dataframe with 300 rows, and what I'd like to see is sample stats across the entire dataframe, but showing only the top (by index) 5 and bottom 5 rows, akin to default pandas
behavior
Just to clarify, I love the current logic of the default dataframe view after import buckaroo
- what I'm looking for is to maintain that wonderful logic, but simply display less / a configurable number of rows. Something akin to pandas
's pd.options.display.max_rows
Ex:
import pandas as pd
df = pd.DataFrame({"a": range(300), "b": ["c" * i for i in range(300)]})
df
shows
import polars as pl
pl.from_dataframe(df)
shows
import buckaroo
df
show all 300 rows, with summary stats over all 300 rows. Desired behavior is to show only top 5 & bottom 5 rows, with summary stats over all 300 rows.
Other than the ellipsis row this should do what you want. I'd need to think a bit about how to accommodate an ellipsis row. You could just do values, but really you want a row with different styling, which requires a separate release for frontend mods.
So far as customizing the default display behavior. I love that you want to do this. It's exactly how I want people to use Buckaroo, customize it with their own opinions, and make it do the thing you want by default.
There are a couple of ways to get the behavior that you want, all that will require some dev work on my end.
- Customize the implementation of buckaroo.widget_utils.enable. This should accept tuples of
(BuckarooKls, dataframeType)
. Then you could have a one liner that callsenable
with your own customized widget. That will work for pandas, it's harder for polars and geopandas, since I have done a bunch of work to keep those dependencies optional - Use some type of customization framework so you could have
.buckaroo
config file.
Why don't you work on some of the customizations available now, and we'll look at these options in future releases.
BTW, If you're up for it, I'd love to talk to you about how you're using Buckaroo. contact me offline, my info is available in my github profile.
Thank you so much! I'll definitely follow up with you on this!
How has this solution been working for you?