nixtla icon indicating copy to clipboard operation
nixtla copied to clipboard

[bug] error message is not correct

Open ngupta23 opened this issue 1 year ago • 10 comments

What happened + What you expected to happen

When passing a missing column name to id_col or time_col, the error message is either not correct or could be improved.

Versions / Dependencies

Click to expand Dependencies: nixtla 0.6.4

Reproducible example

ID Column is missing

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()

# Read the data
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)

# ID column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', id_col="Sth")
ValueError: Series contain missing or duplicate timestamps, or the timestamps do not match the provided frequency.
Please make sure that all series have a single observation from the first to the last timestamp and that the provided frequency matches the timestamps'.
You can refer to https://docs.nixtla.io/docs/tutorials-missing_values for an end to end example.

In this case, the error message is incorrect. There should be some indication that the id_col is incorrect.

Time Column is missing

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()

# Read the data
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)

# Time column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', time_col="Sth")
KeyError: 'Sth'

In this case while the error is somewhat correct, it would be a nicer UX to catch this error and relay the error in a better way to the user (indicating that the time_col is missing)

Issue Severity

Low: It annoys or frustrates me.

ngupta23 avatar Dec 12 '24 22:12 ngupta23

It is possible that we are not checking if that id_col exists or not. This is possibly due to the fact that the default value of id_col is unique_id and it need not be present in the data as in the air passenger dataset even though the argument value passed is unique_id.

In my opinion, it would be better to make the default value of the id_col argument = None and let the user specify it if it exists in the data. Then we should add a check to make sure that if a user specifies this (not None), then it should exist in the dataset before we proceed further.

ngupta23 avatar Dec 16 '24 18:12 ngupta23

This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.

Changing the default would mean breaking the correct workflow, which is having the ID as a column and providing its name.

jmoralez avatar Dec 16 '24 18:12 jmoralez

This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.

+1 on voting for deprecating the support for Pandas indices.

elephaint avatar Dec 16 '24 19:12 elephaint

Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.

jmoralez avatar Dec 16 '24 19:12 jmoralez

Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.

Can we do this - Change the default for id_col to None and then

  1. If the data does not have a ID column and the user leaves the id_col=None, then we treat this as a single time series.
  2. If the data has an ID column and the user specifies this column name in the id_col argument, then we check for it and flag an error if it does not exist.
  3. If the user has an ID column but does not specify it in id_col, it is treated as an exogenous column and will flag error if it is categorical.

Do you see any issues with it?

ngupta23 avatar Dec 16 '24 19:12 ngupta23

How do we know if we're in number 3?

jmoralez avatar Dec 16 '24 20:12 jmoralez

Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).

What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.

jmoralez avatar Dec 16 '24 20:12 jmoralez

3. If the user has an ID column but does not specify it in id_col

id_col would be None in this case (not specified by the user). i.e. the user has to explicitly specify which ID col they want to use (if any).

Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).

Could you clarify which standard this is and what other libraries are we referring to here?

What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.

This would be an acceptable compromise if changing the default to None is not acceptable.

ngupta23 avatar Dec 17 '24 17:12 ngupta23

https://nixtlaverse.nixtla.io/statsforecast/docs/getting-started/getting_started_short.html

The input to StatsForecast is always a data frame in long format with three columns: unique_id, ds and y

https://nixtlaverse.nixtla.io/mlforecast/docs/getting-started/quick_start_local.html#data-format

The data is expected to be a pandas dataframe in long format, that is, each row represents an observation of a single serie at a given time, with at least three columns: id_col: column that identifies each serie. target_col: column that has the series values at each timestamp. time_col: column that contains the time the series value was observed. These are usually timestamps, but can also be consecutive integers.

https://nixtlaverse.nixtla.io/neuralforecast/docs/getting-started/quickstart.html#2-loading-airpassengers-data

The core.NeuralForecast class contains shared, fit, predict and other methods that take as inputs pandas DataFrames with columns ['unique_id', 'ds', 'y']

jmoralez avatar Dec 17 '24 17:12 jmoralez

Thanks @jmoralez - In that case, your suggestion of not changing the default but accepting the value of None would be good to implement. We should also add checks in that case if the column specified by the user exists or not.

ngupta23 avatar Dec 17 '24 18:12 ngupta23