utilsforecast icon indicating copy to clipboard operation
utilsforecast copied to clipboard

[feat] add utility function to fill missing values in target (y)

Open ngupta23 opened this issue 11 months ago • 8 comments

Description

Similar to how fill_gaps adds missing time stamps, providing a utility function to fill missing values would be nice to have as well.

Use case

After a user has used fill_gaps, there are missing target values that need to be filled, else models like TimeGPT will not work. Adding this utility will help users so they do not have to repeat boilerplate code. It will also improve the UX.

ngupta23 avatar Mar 03 '25 20:03 ngupta23

How is that boilerplate? Users may want to ffill, bfill, interpolate, use some kind of statistic, etc. Are you suggesting we implement all of that?

jmoralez avatar Mar 18 '25 16:03 jmoralez

Yes, we can implement at least the most common ones - such as ffill, bfill, interpolate.

ngupta23 avatar Mar 18 '25 16:03 ngupta23

Ok, suppose I want to bfill my target and ffill my exogs. I currently do this:

from utilsforecast.data import generate_series
from utilsforecast.preprocessing import fill_gaps

series = generate_series(5, n_static_features=2)
with_gaps = series.sample(frac=0.5)

filled = fill_gaps(with_gaps, freq='D')
by_serie = filled.groupby('unique_id', observed=True)
filled['y'] = by_serie['y'].bfill()
exogs = ['static_0', 'static_1']
filled[exogs] = by_serie[exogs].ffill()

Please show me the UX improvements.

jmoralez avatar Mar 18 '25 17:03 jmoralez

We need to support many more options for filling than just ffill and bfill. It also needs to support polars in addition to pandas. Once implemented, it will be used in conjunction with the clean_data. Having all this in a convenient wrapper function will help make it more streamlined. If we are already implementing it in Nixtla, why not expose this to users through this utils library instead to make it easier for users to use it as well?

ngupta23 avatar Mar 26 '25 12:03 ngupta23

I want you to provide an example showing that the effort is worth it.

jmoralez avatar Mar 26 '25 17:03 jmoralez

I will implement it in the nixtla repo for audit data. You can have a look over there and we can decide then if we need to move it here.

ngupta23 avatar Mar 26 '25 18:03 ngupta23

I don't mean the code, I mean the API. Will we write 200 lines to save users 1? Or how will it work?

jmoralez avatar Mar 26 '25 19:03 jmoralez

Hi @jmoralez , @ngupta23 ,

Just my thoughts here, what's if you limited exog data to static vars which makes fill strategy slightly easier in my mind at least. Unless I am missing a use case here where they are not just duplicated from available data.

Function call would like below, with fill strategy 0, mean, median, interpolation.

Static vars are just remerged from unique available data (since they are static) or a cartesian product.

This is super helpful when you have hierarchical data its always a bit clunky to fill gaps and reproduce the missing hierarchy info.

df_padded = fill_gaps( df, id_col="unique_id", time_col="date", target_col="sales", freq="D", fill_strategy="ffill", static_cols=["dept_id","store_id"] )

Again just my thoughts, but I agree on the usefulness of being able to handle this in a single call.

tackes avatar Sep 09 '25 11:09 tackes