qlib icon indicating copy to clipboard operation
qlib copied to clipboard

Fillna does not work if fields_group is not None

Open LeetaH666 opened this issue 1 year ago • 2 comments

🐛 Bug Description

The Fillna processor does not work if fields_group is not None since assigning values to df.values changes nothing.

To Reproduce

Use any model and specify fields_group for Fillna processor.

Expected Behavior

No nan after calling Fillna.

Additional Notes

Same as the issue here: https://github.com/microsoft/qlib/issues/1307#issuecomment-1785284039.

LeetaH666 avatar Sep 26 '24 02:09 LeetaH666

I think simply using slice assignment would be ok:

    def __call__(self, df):
        cols = get_group_columns(df, self.fields_group)
        df.loc[:, cols] = df.loc[:, cols].fillna(self.fill_value)
        return df

LeetaH666 avatar Sep 26 '24 03:09 LeetaH666

Or if you want to use numpy to accelerate (I can achieve 10x speed), you should assign the df.values (or df.to_numpy()) to a variable first, then fill and assign back:

    def __call__(self, df):
        if self.fields_group is None:
            df.fillna(self.fill_value, inplace=True)
        else:
            cols = get_group_columns(df, self.fields_group)
            # this implementation is extremely slow
            # df.fillna({col: self.fill_value for col in cols}, inplace=True)

            #! similar to qlib.data.dataset.processor.Fillna, we use numpy to accelerate
            #! but instead, we assign the numpy array to a variable first
            df_values = df[cols].to_numpy()
            nan_select = np.isnan(df_values)
            #! then fill value and assign back
            df_values[nan_select] = self.fill_value
            df.loc[:, cols] = df_values
        return df

LeetaH666 avatar Sep 26 '24 04:09 LeetaH666