scikit-lego [FEATURE] allow for all kwargs when using @log

Hi,

When using @log_step in debugging a Pandas Pipeline, the current function must accept a single argument of df:pd.Dataframe.

However if the user sends all the parameters as kwargs there is an error .

It would be useful if the @log_step will check the first kwargs and if it is a pd.Dataframe then it will convert it into an arg - possible implementation before running the def wrapper() as follows

    _kwargs = {**kwargs}
    first_arg= next(iter(_kwargs))
    if isinstance(_kwargs[first_arg],pd.DataFrame) and len(args)==0:
        args=args+(_kwargs.pop[first_arg],)

Dec 18 '22 12:12 sephib

I took a brief look, but the only thing I could imagine failing is the shape_delta when its set to true. Is that indeed what's happening?

Dec 18 '22 13:12 MBrouns

yes

Dec 18 '22 13:12 sephib

this currently is what i've implements

        if shape_delta:
            if len(args)==0 and len(kwargs)>0:
                _kwargs = {**kwargs}
                first_arg= next(iter(_kwargs))
                if isinstance(_kwargs[first_arg],pd.DataFrame):
                    args=args+(_kwargs.pop[first_arg],)
            old_shape = args[0].shape
        tic = dt.datetime.now()

Dec 18 '22 13:12 sephib

If you think this is a good direction I'll be happy to submit a PR

Dec 19 '22 06:12 sephib

I'm not sure if changing args directly is a great way to solve this because we use it later in result = func(*args, **kwargs)

Maybe something like this?

if shape_delta:
    if len(args) > 0:
        old_shape = args[0].shape
    else: 
        for v in kwargs.values:
              if isinstance(v, pd.DataFrame):
                  old_shape = v.shape
                  break
         else:
               raise ValueError("shape_delta was set to true, but no DataFrame instance was found as either the first non-keyword argument or in the keyword arguments")

Dec 19 '22 08:12 MBrouns

@MBrouns should I provide a PR?

Dec 25 '22 15:12 sephib

It may be an overkill, but a way to do it is to allow to specify the index or name of the argument that is the dataframe to track and then use inspect.signature.

Example assuming the new argument is called arg_to_track:

if shape_delta:
    func_args = (
        inspect.signature(func).bind(*args, **kwargs).arguments  # type: ignore
    )
    if isinstance(arg_to_track, int) and arg_to_track >= 0:
        _, input_frame = tuple(func_args.items())[arg_to_track]
    elif isinstance(arg_to_track, str):
        input_frame = func_args[arg_to_track]
    else:
        raise ValueError("arg_to_track should be a string or a positive integer")
    
    old_shape = input_frame .shape

...
result = func(*args, **kwargs)
...

Remark that both getters could result in errors if

int is out of range
string is not an argument name

Edit: commenting and expanding on this approach From one side this implementation would be as flexible as possible without any misinterpretation (*); on the other hand, it moves responsibility to the user.

(*) The snippet

for v in kwargs.values:
    if isinstance(v, pd.DataFrame):
        old_shape = v.shape
        break

is assuming the the first dataframe passed is the one we want to track, but it could be terribly wrong if a function takes more than one dataframe as input (e.g. if one wants to do a merge inside it)

Sep 25 '23 07:09 FBruzzesi

scikit-lego
scikit-lego copied to clipboard

[FEATURE] allow for all kwargs when using @log_step

scikit-lego scikit-lego copied to clipboard

[FEATURE] allow for all kwargs when using @log_step

scikit-lego
scikit-lego copied to clipboard