[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window
Bug description
I find out that there is no label in valid.parquet.
Steps/Code to reproduce bug
While I m running this code: start_time_window_index = 1 final_time_window_index = 4 for time_index in range(start_time_window_index, final_time_window_index): # Set data time_index_train = time_index time_index_eval = time_index + 1 train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet")) eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet")) # Train on day related to time_index print('*'20) print("Launch training for day %s are:" %time_index) print(''20 + '\n') trainer.train_dataset_or_path = train_paths trainer.reset_lr_scheduler() trainer.train() trainer.state.global_step +=1 # Evaluate on the following day trainer.eval_dataset_or_path = eval_paths train_metrics = trainer.evaluate(metric_key_prefix='eval') print(''20) print("Eval results for day %s are:\t" %time_index_eval) print('\n' + ''*20 + '\n') for key in sorted(train_metrics.keys()): print(" %s = %s" % (key, str(train_metrics[key]))) wipe_memory()
the error appear:
Launch training for day 1 are:
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
{'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'train_loss': 10.525657145182292, 'epoch': 60.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:04<00:00, 14.92it/s]
TrainOutput(global_step=60, training_loss=10.525657145182292, metrics={'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'total_flos': 0.0, 'train_loss': 10.525657145182292})
Traceback (most recent call last):
File "
Expected behavior
I expected there have label for evaluation
Environment details
- Transformers4Rec version: 23.12
- Platform:Docker
- Python version:3.10
- Huggingface Transformers version:4.27.1
- PyTorch version (GPU?):2.1.0a0+4136153
- Tensorflow version (GPU?):
Additional context
@hk63560892 please share the link to the example notebook you are running? and what docker image you are using?
link: https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/tutorial/03-Session-based-recsys.ipynb docker: docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 -v <path_to_data>:/workspace/data/ nvcr.io/nvidia/merlin/merlin-pytorch:23.XX
thankyou!!
@hk63560892 what docker image tag you are using? which 23.XX you are using? we have several ones start with 23. please be specific.
also note that the tutorials have not been maintained for a while so you can refer to other example notebooks in the examples directory.