[feat] add utility function to fill missing values in target (y)
Description
Similar to how fill_gaps adds missing time stamps, providing a utility function to fill missing values would be nice to have as well.
Use case
After a user has used fill_gaps, there are missing target values that need to be filled, else models like TimeGPT will not work. Adding this utility will help users so they do not have to repeat boilerplate code. It will also improve the UX.
How is that boilerplate? Users may want to ffill, bfill, interpolate, use some kind of statistic, etc. Are you suggesting we implement all of that?
Yes, we can implement at least the most common ones - such as ffill, bfill, interpolate.
Ok, suppose I want to bfill my target and ffill my exogs. I currently do this:
from utilsforecast.data import generate_series
from utilsforecast.preprocessing import fill_gaps
series = generate_series(5, n_static_features=2)
with_gaps = series.sample(frac=0.5)
filled = fill_gaps(with_gaps, freq='D')
by_serie = filled.groupby('unique_id', observed=True)
filled['y'] = by_serie['y'].bfill()
exogs = ['static_0', 'static_1']
filled[exogs] = by_serie[exogs].ffill()
Please show me the UX improvements.
We need to support many more options for filling than just ffill and bfill. It also needs to support polars in addition to pandas. Once implemented, it will be used in conjunction with the clean_data. Having all this in a convenient wrapper function will help make it more streamlined. If we are already implementing it in Nixtla, why not expose this to users through this utils library instead to make it easier for users to use it as well?
I want you to provide an example showing that the effort is worth it.
I will implement it in the nixtla repo for audit data. You can have a look over there and we can decide then if we need to move it here.
I don't mean the code, I mean the API. Will we write 200 lines to save users 1? Or how will it work?
Hi @jmoralez , @ngupta23 ,
Just my thoughts here, what's if you limited exog data to static vars which makes fill strategy slightly easier in my mind at least. Unless I am missing a use case here where they are not just duplicated from available data.
Function call would like below, with fill strategy 0, mean, median, interpolation.
Static vars are just remerged from unique available data (since they are static) or a cartesian product.
This is super helpful when you have hierarchical data its always a bit clunky to fill gaps and reproduce the missing hierarchy info.
df_padded = fill_gaps( df, id_col="unique_id", time_col="date", target_col="sales", freq="D", fill_strategy="ffill", static_cols=["dept_id","store_id"] )
Again just my thoughts, but I agree on the usefulness of being able to handle this in a single call.