[bug] error message is not correct
What happened + What you expected to happen
When passing a missing column name to id_col or time_col, the error message is either not correct or could be improved.
Versions / Dependencies
Click to expand
Dependencies: nixtla 0.6.4Reproducible example
ID Column is missing
import pandas as pd
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()
# Read the data
df = pd.read_csv(
'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)
# ID column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', id_col="Sth")
ValueError: Series contain missing or duplicate timestamps, or the timestamps do not match the provided frequency.
Please make sure that all series have a single observation from the first to the last timestamp and that the provided frequency matches the timestamps'.
You can refer to https://docs.nixtla.io/docs/tutorials-missing_values for an end to end example.
In this case, the error message is incorrect. There should be some indication that the id_col is incorrect.
Time Column is missing
import pandas as pd
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()
# Read the data
df = pd.read_csv(
'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)
# Time column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', time_col="Sth")
KeyError: 'Sth'
In this case while the error is somewhat correct, it would be a nicer UX to catch this error and relay the error in a better way to the user (indicating that the time_col is missing)
Issue Severity
Low: It annoys or frustrates me.
It is possible that we are not checking if that id_col exists or not. This is possibly due to the fact that the default value of id_col is unique_id and it need not be present in the data as in the air passenger dataset even though the argument value passed is unique_id.
In my opinion, it would be better to make the default value of the id_col argument = None and let the user specify it if it exists in the data. Then we should add a check to make sure that if a user specifies this (not None), then it should exist in the dataset before we proceed further.
This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.
Changing the default would mean breaking the correct workflow, which is having the ID as a column and providing its name.
This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.
+1 on voting for deprecating the support for Pandas indices.
Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.
Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as
id_colthere's no way for us to know.
Can we do this - Change the default for id_col to None and then
- If the data does not have a ID column and the user leaves the
id_col=None, then we treat this as a single time series. - If the data has an ID column and the user specifies this column name in the
id_colargument, then we check for it and flag an error if it does not exist. - If the user has an ID column but does not specify it in
id_col, it is treated as an exogenous column and will flag error if it is categorical.
Do you see any issues with it?
How do we know if we're in number 3?
Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).
What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.
3. If the user has an ID column but does not specify it in
id_col
id_col would be None in this case (not specified by the user). i.e. the user has to explicitly specify which ID col they want to use (if any).
Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).
Could you clarify which standard this is and what other libraries are we referring to here?
What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.
This would be an acceptable compromise if changing the default to None is not acceptable.
https://nixtlaverse.nixtla.io/statsforecast/docs/getting-started/getting_started_short.html
The input to StatsForecast is always a data frame in long format with three columns: unique_id, ds and y
https://nixtlaverse.nixtla.io/mlforecast/docs/getting-started/quick_start_local.html#data-format
The data is expected to be a pandas dataframe in long format, that is, each row represents an observation of a single serie at a given time, with at least three columns: id_col: column that identifies each serie. target_col: column that has the series values at each timestamp. time_col: column that contains the time the series value was observed. These are usually timestamps, but can also be consecutive integers.
https://nixtlaverse.nixtla.io/neuralforecast/docs/getting-started/quickstart.html#2-loading-airpassengers-data
The core.NeuralForecast class contains shared, fit, predict and other methods that take as inputs pandas DataFrames with columns ['unique_id', 'ds', 'y']
Thanks @jmoralez - In that case, your suggestion of not changing the default but accepting the value of None would be good to implement. We should also add checks in that case if the column specified by the user exists or not.