neuralforecast
neuralforecast copied to clipboard
model training got stuck when running the official tutorial example
What happened + What you expected to happen
Hi,
I'm new to nixtla. When I was trying to run the example code in official tutorial on my local machine(Linux, CentOS): https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html, I found it got stuck at nf.fit(df=Y_df)
step:
2024-03-21 17:00:29,350 INFO worker.py:1724 -- Started a local Ray instance.
2024-03-21 17:00:29,926 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2024-03-21 17:00:29,927 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 0. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment _train_tune_2024-03-21_17-00-27 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 5 │
╰────────────────────────────────────────────────────────────────────╯
View detailed results here: /root/ray_results/_train_tune_2024-03-21_17-00-27
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/_train_tune_2024-03-21_17-00-27`
(_train_tune pid=2885517) Seed set to 11
(_train_tune pid=2885517) [rank: 0] Seed set to 11
(_train_tune pid=2885517) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
The processe I did to set up:
- use conda to create a new environment with Python 3.9
- run
pip install statsforecast s3fs datasetsforecast
in the tutorial example. - run
pip install git+https://github.com/Nixtla/neuralforecast.git@main
in the tutorial example. - run
pip install matplotlib
in order to get the 3rd step of the tutorial work. - I change the code in:
nf = NeuralForecast( models=[ AutoNHITS(h=48, config=config_nhits, loss=MQLoss(), num_samples=5), AutoLSTM(h=48, config=config_lstm, loss=MQLoss(), num_samples=2), ], freq='H' )
withfreq='H'
tofreq=1
sinceValueError: Time column contains integers but the specified frequency is not an integer. Please provide a valid integer, e.g. 'freq=1'
I was wondering what could possibly go wrong in the upper steps and why it got stuck at the training process.
Then, I tried the tutorial notebook in Colab. The fit process can be done, though there is an error when evaluation evaluation_df = accuracy(cv_df, [mse, mae, rmse], agg_by=['unique_id'])
:
ValueError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
359 elif isinstance(obj, pd.DataFrame):
--> 360 self._append_pa_schema(PD_UTILS.to_schema(obj))
361 elif isinstance(obj, Tuple): # type: ignore
11 frames
ValueError: pandas like datafame index can't have name
During handling of the above exception, another exception occurred:
SchemaError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
370 raise
371 except Exception as e:
--> 372 raise SchemaError(str(e))
373
374 def remove( # noqa: C901
SchemaError: pandas like datafame index can't have name
Looking forward to your reply.
Versions / Dependencies
OS: Linux CentOS neuralforecast 1.6.4 python 3.9.18 ray 2.9.3 torch 2.2.1 transformers 4.39.0 pandas 2.2.1
Reproduction script
Official tutorial example: https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html
Issue Severity
High: It blocks me from completing my task.
Hey @hxuaj, sorry for the troubles.
The first error should be fixed by setting the CUDA_VISIBLE_DEVICES
env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ
.
The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse
evaluation_df = evaluate(cv_df, [mse, mae, rmse])
Hey @hxuaj, sorry for the troubles.
The first error should be fixed by setting the
CUDA_VISIBLE_DEVICES
env variable to one of your devices (0 or 1), either through the terminal or in your session withos.environ
.The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:
from utilsforecast.evaluation import evaluate from utilsforecast.losses import mse, mae, rmse evaluation_df = evaluate(cv_df, [mse, mae, rmse])
Hi @jmoralez , Thx for the quick reply. For the first error, my local machine has 2 gpus, seems like a bug with Pytorch lightning: https://github.com/Lightning-AI/pytorch-lightning/issues/4612. However I didn't find a proper solution to this. Just as you suggested, now I can run model fit with only one gpu visible as a workaround. For the second error, I changed the code to:
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse
cv_df.reset_index(inplace=True)
evaluation_df = evaluate(cv_df, [mse, mae, rmse])
Just add index to df before evaluation. Now it works fine.
Could you update the relevant parts in this official tutorial, since it might be frustrated to encounter such error in the exampls. Thank you again.
Just add index to df before evaluation. Now it works fine.
Just ran into the same issue, your workaround fixed it, thanks @hxuaj
@jmoralez , BTW, the error has a typo - datafame
vs dataf_r_ame
(and may as well fix the grammer too: pandas-like dataframe index can't have name
)
BTW, the error has a typo
That's not coming from our libs, feel free to open an issue in the corresponding lib.