scikit-lego
scikit-lego copied to clipboard
[FEATURE] allow for all kwargs when using @log_step
Hi,
When using @log_step in debugging a Pandas Pipeline, the current function must accept a single argument of df:pd.Dataframe.
However if the user sends all the parameters as kwargs there is an error .
It would be useful if the @log_step will check the first kwargs and if it is a pd.Dataframe then it will convert it into an arg - possible implementation before running the def wrapper() as follows
_kwargs = {**kwargs}
first_arg= next(iter(_kwargs))
if isinstance(_kwargs[first_arg],pd.DataFrame) and len(args)==0:
args=args+(_kwargs.pop[first_arg],)
I took a brief look, but the only thing I could imagine failing is the shape_delta when its set to true. Is that indeed what's happening?
yes
this currently is what i've implements
if shape_delta:
if len(args)==0 and len(kwargs)>0:
_kwargs = {**kwargs}
first_arg= next(iter(_kwargs))
if isinstance(_kwargs[first_arg],pd.DataFrame):
args=args+(_kwargs.pop[first_arg],)
old_shape = args[0].shape
tic = dt.datetime.now()
If you think this is a good direction I'll be happy to submit a PR
I'm not sure if changing args directly is a great way to solve this because we use it later in result = func(*args, **kwargs)
Maybe something like this?
if shape_delta:
if len(args) > 0:
old_shape = args[0].shape
else:
for v in kwargs.values:
if isinstance(v, pd.DataFrame):
old_shape = v.shape
break
else:
raise ValueError("shape_delta was set to true, but no DataFrame instance was found as either the first non-keyword argument or in the keyword arguments")
@MBrouns should I provide a PR?
It may be an overkill, but a way to do it is to allow to specify the index or name of the argument that is the dataframe to track and then use inspect.signature.
Example assuming the new argument is called arg_to_track:
if shape_delta:
func_args = (
inspect.signature(func).bind(*args, **kwargs).arguments # type: ignore
)
if isinstance(arg_to_track, int) and arg_to_track >= 0:
_, input_frame = tuple(func_args.items())[arg_to_track]
elif isinstance(arg_to_track, str):
input_frame = func_args[arg_to_track]
else:
raise ValueError("arg_to_track should be a string or a positive integer")
old_shape = input_frame .shape
...
result = func(*args, **kwargs)
...
Remark that both getters could result in errors if
- int is out of range
- string is not an argument name
Edit: commenting and expanding on this approach From one side this implementation would be as flexible as possible without any misinterpretation (*); on the other hand, it moves responsibility to the user.
(*) The snippet
for v in kwargs.values:
if isinstance(v, pd.DataFrame):
old_shape = v.shape
break
is assuming the the first dataframe passed is the one we want to track, but it could be terribly wrong if a function takes more than one dataframe as input (e.g. if one wants to do a merge inside it)