subgrounds
subgrounds copied to clipboard
Updated dataframe interactions
Is your feature request related to a problem? Please describe. I cannot use modern dataframe techniques effectively with subgrounds.
Describe the solution you'd like I would like to leverage modern pandas (2.0) and polars alongside with the arrow data format and duckdb directly when using subgrounds.
query_arrow
or even query(format="pandas")
, perhaps a generic query interface similar to how PaginationStrategy
is decided.
Theoretically, we could add a new argument to query
and codify the existing query
function with a default legacy_query
callable or interface.
Describe alternatives you've considered
You can use query_json
to implement a custom query
but it's undocumented and quite obtuse. It's also quite awkward to navigate the python-ification of data types which often have to be undo'd with polars for example
Additional context
- The interface for this should be future proof as likely, we'll want to push it over
query_df
as we currently do. Theoretically,query_df
could be switched to this newquery
interface as a shorthand to maintain backwards compatibility.- In a theoretical subgrounds 2.0, breaking changes could elevate this interface further.
- The arrow data format is an interesting standardization that could be rooted deeper in this interface.
- Something like
query_arrow
could easily be converted to apandas>=2.0
orpolars
dataframe without any conversion-loss, etc.
- Something like
- Likely, we'll want to push an alpha build of this interface for testers
Implementation checklist
- [ ] Task 1
Just opened an issue that is similar to this - https://github.com/0xPlaygrounds/subgrounds/issues/42
CLosing out issue 42 and adding the comment to this issue to consolidate the discussion:
Subgrounds should offer dataframe graphql function support for multiple libraries as well, not just Pandas. Currently the only dataframe utility functions are Pandas, found here
The current direction of Subgrounds is going towards a multi-client world. One alternative client to the base client would be to utilize polars instead of pandas dataframes. However, currently dataframe_utils.py only offers pandas function helpers, which actively discriminates against using polars with Subgrounds.
To utilize subgrounds with polars, examples of functions that need to be constantly defined are fmt_dict_cols and fmt_arr_cols.
fmt_dict_cols - required to convert graphql json data into polars dataframe columns fmt_arr_cols - required to separate graphql json data fields that contain arrays into polars individual dataframe columns.
Example code:
def fmt_dict_cols(df: pl.DataFrame) -> pl.DataFrame:
"""
formats dictionary cols, which are 'structs' in a polars df, into separate columns and renames accordingly.
"""
for column in df.columns:
if isinstance(df[column][0], dict):
col_names = df[column][0].keys()
# rename struct columns
struct_df = df.select(
pl.col(column).struct.rename_fields([f'{column}_{c}' for c in col_names])
)
struct_df = struct_df.unnest(column)
# add struct_df columns to df and
df = df.with_columns(struct_df)
# drop the df column
df = df.drop(column)
return df
def fmt_arr_cols(df: pl.DataFrame) -> pl.DataFrame:
"""
formats lists, which are arrays in a polars df, into separate columns and renames accordingly.
Since there isn't a direct way to convert array -> new columns, we convert the array to a struct and then
unnest the struct into new columns.
"""
# use this logic if column is a list (rows show up as pl.Series)
for column in df.columns:
if isinstance(df[column][0], pl.Series):
# convert struct to array
struct_df = df.select([pl.col(column).arr.to_struct()])
# rename struct fields
struct_df = struct_df.select(
pl.col(column).struct.rename_fields([f"{column}_{i}" for i in range(len(struct_df.shape))])
)
# unnest struct fields into their own columns
struct_df = struct_df.unnest(column)
# add struct_df columns to df and
df = df.with_columns(struct_df)
# drop the df column
df = df.drop(column)
return df
```