Transformers4Rec icon indicating copy to clipboard operation
Transformers4Rec copied to clipboard

[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window

Open hk63560892 opened this issue 1 year ago • 3 comments

Bug description

I find out that there is no label in valid.parquet.

Steps/Code to reproduce bug

While I m running this code: start_time_window_index = 1 final_time_window_index = 4 for time_index in range(start_time_window_index, final_time_window_index): # Set data time_index_train = time_index time_index_eval = time_index + 1 train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet")) eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet")) # Train on day related to time_index print('*'20) print("Launch training for day %s are:" %time_index) print(''20 + '\n') trainer.train_dataset_or_path = train_paths trainer.reset_lr_scheduler() trainer.train() trainer.state.global_step +=1 # Evaluate on the following day trainer.eval_dataset_or_path = eval_paths train_metrics = trainer.evaluate(metric_key_prefix='eval') print(''20) print("Eval results for day %s are:\t" %time_index_eval) print('\n' + ''*20 + '\n') for key in sorted(train_metrics.keys()): print(" %s = %s" % (key, str(train_metrics[key]))) wipe_memory()

the error appear:


Launch training for day 1 are:


/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( {'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'train_loss': 10.525657145182292, 'epoch': 60.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:04<00:00, 14.92it/s] TrainOutput(global_step=60, training_loss=10.525657145182292, metrics={'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'total_flos': 0.0, 'train_loss': 10.525657145182292}) Traceback (most recent call last): File "", line 17, in File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2932, in evaluate output = eval_loop( File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/trainer.py", line 515, in evaluation_loop metrics_results_detailed = model.calculate_metrics(preds, labels) File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py", line 616, in calculate_metrics head.calculate_metrics( File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py", line 453, in calculate_metrics task.calculate_metrics( File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/prediction_task.py", line 489, in calculate_metrics result = metric(predictions, targets) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 301, in forward self._forward_cache = self._forward_full_state_update(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 316, in _forward_full_state_update self.update(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 465, in wrapped_func update(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/ranking_metric.py", line 56, in update metric = self._metric( File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/ranking_metric.py", line 137, in _metric if rel_indices.shape[0] > 0: IndexError: tuple index out of range

Expected behavior

I expected there have label for evaluation

Environment details

  • Transformers4Rec version: 23.12
  • Platform:Docker
  • Python version:3.10
  • Huggingface Transformers version:4.27.1
  • PyTorch version (GPU?):2.1.0a0+4136153
  • Tensorflow version (GPU?):

Additional context

hk63560892 avatar Jul 17 '24 01:07 hk63560892

@hk63560892 please share the link to the example notebook you are running? and what docker image you are using?

rnyak avatar Jul 18 '24 18:07 rnyak

link: https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/tutorial/03-Session-based-recsys.ipynb docker: docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 -v <path_to_data>:/workspace/data/ nvcr.io/nvidia/merlin/merlin-pytorch:23.XX

thankyou!!

hk63560892 avatar Jul 22 '24 04:07 hk63560892

@hk63560892 what docker image tag you are using? which 23.XX you are using? we have several ones start with 23. please be specific.

also note that the tutorials have not been maintained for a while so you can refer to other example notebooks in the examples directory.

rnyak avatar Jul 22 '24 17:07 rnyak