Transformers4Rec
Transformers4Rec copied to clipboard
[QST]Transformers4Rec/examples/tutorial/03-Session-based-recsys.ipynb.
❓ Questions & Help
Details
I have executed 01 and 02 successfully and in 03 as well everything before this block under "3.2.4 Train XLNET with Side Information for Next Item Prediction" had no issues.
However this block is giving some issue :
%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
# Set data
time_index_train = time_index
time_index_eval = time_index + 1
train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
# Train on day related to time_index
print('*'*20)
print("Launch training for day %s are:" %time_index)
print('*'*20 + '\n')
trainer.train_dataset_or_path = train_paths
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step +=1
# Evaluate on the following day
trainer.eval_dataset_or_path = eval_paths
train_metrics = trainer.evaluate(metric_key_prefix='eval')
print('*'*20)
print("Eval results for day %s are:\t" %time_index_eval)
print('\n' + '*'*20 + '\n')
for key in sorted(train_metrics.keys()):
print(" %s = %s" % (key, str(train_metrics[key])))
wipe_memory()
Error is :
***** Running training *****
Num examples = 22784
Num Epochs = 3
Instantaneous batch size per device = 256
Total train batch size (w. parallel, distributed & accumulation) = 256
Gradient Accumulation steps = 1
Total optimization steps = 267
********************
Launch training for day 1 are:
********************
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<timed exec> in <module>
[/usr/local/lib/python3.9/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1394
1395 step = -1
-> 1396 for step, inputs in enumerate(epoch_iterator):
1397
1398 # Skip past any already trained steps if resuming training
17 frames
[/usr/local/lib/python3.9/dist-packages/cudf/io/dlpack.py](https://localhost:8080/#) in to_dlpack(cudf_obj)
90 gdf = gdf.astype(dtype)
91 arr_cupy = cp.array(df.fillna(-1).to_gpu_matrix())
---> 92
93
94 return libdlpack.to_dlpack([*gdf._columns])
interop.pyx in cudf._lib.interop.to_dlpack()
ValueError: Cannot create a DLPack tensor with null values. Input is required to have null count as zero.
Pls help me if i am missing anything to cross check, i am unable to figure out here
ValueError: Cannot create a DLPack tensor with null values.
does your train set or validation set has null
values? can you check pls?
@rnyak I did a cross check , i have 10 days data (10 folders ) every folder has train,test and valid. i executed one by one df.isnull().any() and all variable has false.
Also as per notebook null treatment in there in preprocessing , so it is handled.
what strange here is that, almost similar block in
#1> Model finetuning and incremental evaluation #2> 3.2.3 Train XLNET for Next Item Prediction
has executed successfully
@sparta0000 if you are running the tutorial notebooks with ecommerce behavior dataset, I cannot repro your issue. all works fine for me. I dont get ValueError: Cannot create a DLPack tensor with null values.
issue. Are you running the notebooks with your custom dataset?
Besides, can you please run the notebook 01 and 02 in this folder and see if you are getting any issue?
@rnyak yes i am running notebook with my custom dataset. I have specifically checked for null values though. Is there any other step /activity i should take care of while preparing custom dataset ? I have used data in same format and schema as tutorial notebook.
Also while executing 1st notebook in the above folder you mentioned . It is synthetic data only but i am getting this error in feature engineering step
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[<ipython-input-10-08ff61414734>](https://localhost:8080/#) in <module>
52 workflow = nvt.Workflow(filtered_sessions['session_id', 'day-first', 'item_id-count'] + seq_feats_list)
53
---> 54 dataset = nvt.Dataset(df, cpu=False)
55 # Generate statistics for the features
56 workflow.fit(dataset)
1 frames
[/usr/local/lib/python3.9/dist-packages/merlin/core/dispatch.py](https://localhost:8080/#) in convert_data(x, cpu, to_collection, npartitions)
601 _x = cudf.DataFrame.from_arrow(x)
602 elif isinstance(x, pd.DataFrame):
--> 603 _x = cudf.DataFrame.from_pandas(x)
604 # Output a collection if `to_collection=True`
605 return (
AttributeError: 'NoneType' object has no attribute 'DataFrame'
Same error i'm also facing now, last friday it was running properly and now it is giving error, i installed latest version. @rnyak can you look into this.
Install this if using notebook..
!pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com
!pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12 pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/
!pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12 pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com
error goes away for me.
@alan-ai-learner Thanks, with this i am able to execute notebook 01-ETL-with-NVTabular which is based on synthetic data, but my actual issue is this one:
ValueError: Cannot create a DLPack tensor with null values. Input is required to have null count as zero.
If you have any solution , let me know
Install this if using notebook..
!pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com !pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12 pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/ !pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12 pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com
error goes away for me.
where are you getting this error? from our notebooks or from your custom datasets? looks like you found a solution. I cannot repro this issue since I am using merlin docker images, but I believe you are installing merlin libs with pip
? so yes, you need to install cudf and dask_cudf properly first.
@alan-ai-learner Thanks, with this i am able to execute notebook 01-ETL-with-NVTabular which is based on synthetic data, but my actual issue is this one:
ValueError: Cannot create a DLPack tensor with null values. Input is required to have null count as zero.
If you have any solution , let me know
@sparta0000
- can you please tell us which operator is giving you this error? you can take operators out one by one from the final features that goes to
nvt.Workflow(..)
and see which operator you are getting this error from. - are you getting this error from train set or from validation set? basically from
workflow.fit()
or fromworkflow.transform()
?
I was getting this error when i m training t4rec on custom data. Till last friday cudf was not mandatory to install but now it is.> Install this if using notebook..
!pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com
!pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12 pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/
!pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12 pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com
error goes away for me.
where are you getting this error? from our notebooks or from your custom datasets? looks like you found a solution. I cannot repro this issue since I am using merlin docker images, but I believe you are installing merlin libs with pip
? so yes, you need to install cudf and dask_cudf properly first.
!pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com !pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12 pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/ !pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12 pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com
It is on 3.2.4 Train XLNET with Side Information for Next Item Prediction I am able to execute this block after commenting the " trainer" operator
So final code looks like this :
%%time
start_time_window_index = 5
final_time_window_index = 7
for time_index in range(start_time_window_index, final_time_window_index):
# Set data
time_index_train = time_index
time_index_eval = time_index + 1
train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
# Train on day related to time_index
print('*'*20)
print("Launch training for day %s are:" %time_index)
print('*'*20 + '\n')
trainer.train_dataset_or_path = train_paths
#trainer.reset_lr_scheduler() ---------line 1
#trainer.train(). ---------line 2
trainer.state.global_step +=1
# Evaluate on the following day
trainer.eval_dataset_or_path = eval_paths
#train_metrics = trainer.evaluate(metric_key_prefix='eval') ------line 3
print('*'*20)
print("Eval results for day %s are:\t" %time_index_eval)
print('\n' + '*'*20 + '\n')
for key in sorted(train_metrics.keys()):
print(" %s = %s" % (key, str(train_metrics[key])))
wipe_memory()
Executed Output :
********************
Launch training for day 5 are:
********************
********************
Eval results for day 6 are:
********************
eval_/loss = 10.597350120544434
eval_/next-item/ndcg_at_10 = 0.015089108608663082
eval_/next-item/ndcg_at_20 = 0.01897633634507656
eval_/next-item/recall_at_10 = 0.02821546420454979
eval_/next-item/recall_at_20 = 0.04360571876168251
eval_runtime = 2.139
eval_samples_per_second = 1286.587
eval_steps_per_second = 40.206
********************
Launch training for day 6 are:
********************
********************
Eval results for day 7 are:
********************
eval_/loss = 10.597350120544434
eval_/next-item/ndcg_at_10 = 0.015089108608663082
eval_/next-item/ndcg_at_20 = 0.01897633634507656
eval_/next-item/recall_at_10 = 0.02821546420454979
eval_/next-item/recall_at_20 = 0.04360571876168251
eval_runtime = 2.139
eval_samples_per_second = 1286.587
eval_steps_per_second = 40.206
CPU times: user 621 ms, sys: 7.64 ms, total: 629 ms
Wall time: 666 ms
@sparta0000 you should not comment out these two lines. if you do so, the model wont really train and learn from data.
trainer.reset_lr_scheduler()
trainer.train()
If you can share a sample (small size) of your raw train dataset ( that gives the DLPack error) together with your NVT workflow script I can check it out what's going on. otherwise it is hard to reproduce your error. your train dataset might have some NANs, nulls or Nones that might creating issue..
@rnyak raw_data.csv here's the raw data
I am getting issue while performing : Train XLNET with Side Information for Next Item Prediction
Example notebook link here https://github.com/sparta0000/Traformer4rec_example/blob/main/end_to_end_flow.ipynb
@rnyak please let me know if any update . Also I had one question, Do you have any example where we can use this approach and could recommend " frequently bought together" item against every product ?
@sparta0000 do you still have the issue?
TypeError: torch.Size() takes an iterable of 'int' (item 1 is 'NoneType') i am getting this issue when I am applying Xlnet on movielens dataset, and my schema is schema = schema.select_by_name( ['userId', 'movieId','genres'] ).. please let me know how to resolve it
@NamartaVij we need more info to help you. what's your NVT and model script? can you please share here? We need to see you are doing the tagging properly. Please note that you should have sequential (list) columns, and you should tag it as ITEMID.
@sparta0000 do you still have the issue?
@rnyak thanks for replying, Actually i was evaluating different algos at that time and since this didn't worked i finalised RecVAE for my work as it was performing good. will try it next time and connect if i needed help. Thank you very much