Transformers4Rec icon indicating copy to clipboard operation
Transformers4Rec copied to clipboard

[QST]Transformers4Rec/examples/tutorial/03-Session-based-recsys.ipynb.

Open sparta0000 opened this issue 1 year ago • 18 comments

❓ Questions & Help

Details

I have executed 01 and 02 successfully and in 03 as well everything before this block under "3.2.4 Train XLNET with Side Information for Next Item Prediction" had no issues.

However this block is giving some issue :

%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.train_dataset_or_path = train_paths
    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1
    # Evaluate on the following day
    trainer.eval_dataset_or_path = eval_paths
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    wipe_memory()

Error is :

***** Running training *****
  Num examples = 22784
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 267
********************
Launch training for day 1 are:
********************

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>

[/usr/local/lib/python3.9/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1394 
   1395             step = -1
-> 1396             for step, inputs in enumerate(epoch_iterator):
   1397 
   1398                 # Skip past any already trained steps if resuming training

17 frames
[/usr/local/lib/python3.9/dist-packages/cudf/io/dlpack.py](https://localhost:8080/#) in to_dlpack(cudf_obj)
     90     gdf = gdf.astype(dtype)
     91     arr_cupy = cp.array(df.fillna(-1).to_gpu_matrix())
---> 92 
     93 
     94     return libdlpack.to_dlpack([*gdf._columns])

interop.pyx in cudf._lib.interop.to_dlpack()

ValueError: Cannot create a DLPack tensor with null values.                 Input is required to have null count as zero.

Pls help me if i am missing anything to cross check, i am unable to figure out here

sparta0000 avatar Mar 10 '23 11:03 sparta0000

ValueError: Cannot create a DLPack tensor with null values.

does your train set or validation set has null values? can you check pls?

rnyak avatar Mar 10 '23 16:03 rnyak

@rnyak I did a cross check , i have 10 days data (10 folders ) every folder has train,test and valid. i executed one by one df.isnull().any() and all variable has false.

Also as per notebook null treatment in there in preprocessing , so it is handled.

what strange here is that, almost similar block in

#1> Model finetuning and incremental evaluation #2> 3.2.3 Train XLNET for Next Item Prediction

has executed successfully

sparta0000 avatar Mar 10 '23 17:03 sparta0000

@sparta0000 if you are running the tutorial notebooks with ecommerce behavior dataset, I cannot repro your issue. all works fine for me. I dont get ValueError: Cannot create a DLPack tensor with null values. issue. Are you running the notebooks with your custom dataset?

Besides, can you please run the notebook 01 and 02 in this folder and see if you are getting any issue?

rnyak avatar Mar 12 '23 23:03 rnyak

@rnyak yes i am running notebook with my custom dataset. I have specifically checked for null values though. Is there any other step /activity i should take care of while preparing custom dataset ? I have used data in same format and schema as tutorial notebook.

Also while executing 1st notebook in the above folder you mentioned . It is synthetic data only but i am getting this error in feature engineering step

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-10-08ff61414734>](https://localhost:8080/#) in <module>
     52 workflow = nvt.Workflow(filtered_sessions['session_id', 'day-first', 'item_id-count'] + seq_feats_list)
     53 
---> 54 dataset = nvt.Dataset(df, cpu=False)
     55 # Generate statistics for the features
     56 workflow.fit(dataset)

1 frames
[/usr/local/lib/python3.9/dist-packages/merlin/core/dispatch.py](https://localhost:8080/#) in convert_data(x, cpu, to_collection, npartitions)
    601                 _x = cudf.DataFrame.from_arrow(x)
    602             elif isinstance(x, pd.DataFrame):
--> 603                 _x = cudf.DataFrame.from_pandas(x)
    604             # Output a collection if `to_collection=True`
    605             return (

AttributeError: 'NoneType' object has no attribute 'DataFrame'

sparta0000 avatar Mar 13 '23 03:03 sparta0000

Same error i'm also facing now, last friday it was running properly and now it is giving error, i installed latest version. @rnyak can you look into this.

alan-ai-learner avatar Mar 13 '23 08:03 alan-ai-learner

Install this if using notebook..

 !pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com
 !pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12  pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/
 !pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12  pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com

error goes away for me.

alan-ai-learner avatar Mar 13 '23 10:03 alan-ai-learner

@alan-ai-learner Thanks, with this i am able to execute notebook 01-ETL-with-NVTabular which is based on synthetic data, but my actual issue is this one:

ValueError: Cannot create a DLPack tensor with null values. Input is required to have null count as zero.

If you have any solution , let me know

sparta0000 avatar Mar 13 '23 11:03 sparta0000

Install this if using notebook..

 !pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com
 !pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12  pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/
 !pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12  pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com

error goes away for me.

where are you getting this error? from our notebooks or from your custom datasets? looks like you found a solution. I cannot repro this issue since I am using merlin docker images, but I believe you are installing merlin libs with pip ? so yes, you need to install cudf and dask_cudf properly first.

rnyak avatar Mar 13 '23 14:03 rnyak

@alan-ai-learner Thanks, with this i am able to execute notebook 01-ETL-with-NVTabular which is based on synthetic data, but my actual issue is this one:

ValueError: Cannot create a DLPack tensor with null values. Input is required to have null count as zero.

If you have any solution , let me know

@sparta0000

  • can you please tell us which operator is giving you this error? you can take operators out one by one from the final features that goes to nvt.Workflow(..) and see which operator you are getting this error from.
  • are you getting this error from train set or from validation set? basically from workflow.fit() or from workflow.transform()?

rnyak avatar Mar 13 '23 14:03 rnyak

I was getting this error when i m training t4rec on custom data. Till last friday cudf was not mandatory to install but now it is.> Install this if using notebook..

!pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com

!pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12 pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/

!pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12 pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com

error goes away for me.

where are you getting this error? from our notebooks or from your custom datasets? looks like you found a solution. I cannot repro this issue since I am using merlin docker images, but I believe you are installing merlin libs with pip ? so yes, you need to install cudf and dask_cudf properly first.

alan-ai-learner avatar Mar 13 '23 14:03 alan-ai-learner

!pip install cudf-cu11==22.12 rmm-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com !pip install cugraph-cu11==22.12 dask-cuda==22.12 dask-cudf-cu11==22.12 pylibcugraph-cu11==22.12 --extra-index-url=https://pypi.ngc.nvidia.com/ !pip install cuml-cu11==22.12 raft_dask_cu11==22.12 dask-cudf-cu11==22.12 pylibraft_cu11==22.12 ucx-py-cu11==0.29.0 --extra-index-url=https://pypi.ngc.nvidia.com

It is on 3.2.4 Train XLNET with Side Information for Next Item Prediction I am able to execute this block after commenting the " trainer" operator

So final code looks like this :

%%time
start_time_window_index = 5
final_time_window_index = 7
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.train_dataset_or_path = train_paths
    #trainer.reset_lr_scheduler()                            ---------line 1
    #trainer.train().                                                   ---------line 2
    trainer.state.global_step +=1
    # Evaluate on the following day
    trainer.eval_dataset_or_path = eval_paths
    #train_metrics = trainer.evaluate(metric_key_prefix='eval')              ------line 3
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    wipe_memory()

Executed Output :

********************
Launch training for day 5 are:
********************

********************
Eval results for day 6 are:	

********************

 eval_/loss = 10.597350120544434
 eval_/next-item/ndcg_at_10 = 0.015089108608663082
 eval_/next-item/ndcg_at_20 = 0.01897633634507656
 eval_/next-item/recall_at_10 = 0.02821546420454979
 eval_/next-item/recall_at_20 = 0.04360571876168251
 eval_runtime = 2.139
 eval_samples_per_second = 1286.587
 eval_steps_per_second = 40.206
********************
Launch training for day 6 are:
********************

********************
Eval results for day 7 are:	

********************

 eval_/loss = 10.597350120544434
 eval_/next-item/ndcg_at_10 = 0.015089108608663082
 eval_/next-item/ndcg_at_20 = 0.01897633634507656
 eval_/next-item/recall_at_10 = 0.02821546420454979
 eval_/next-item/recall_at_20 = 0.04360571876168251
 eval_runtime = 2.139
 eval_samples_per_second = 1286.587
 eval_steps_per_second = 40.206
CPU times: user 621 ms, sys: 7.64 ms, total: 629 ms
Wall time: 666 ms

sparta0000 avatar Mar 13 '23 17:03 sparta0000

@sparta0000 you should not comment out these two lines. if you do so, the model wont really train and learn from data.

trainer.reset_lr_scheduler()                     
trainer.train()

If you can share a sample (small size) of your raw train dataset ( that gives the DLPack error) together with your NVT workflow script I can check it out what's going on. otherwise it is hard to reproduce your error. your train dataset might have some NANs, nulls or Nones that might creating issue..

rnyak avatar Mar 14 '23 15:03 rnyak

@rnyak raw_data.csv here's the raw data

I am getting issue while performing : Train XLNET with Side Information for Next Item Prediction

Example notebook link here https://github.com/sparta0000/Traformer4rec_example/blob/main/end_to_end_flow.ipynb

sparta0000 avatar Mar 15 '23 08:03 sparta0000

@rnyak please let me know if any update . Also I had one question, Do you have any example where we can use this approach and could recommend " frequently bought together" item against every product ?

sparta0000 avatar Mar 17 '23 05:03 sparta0000

@sparta0000 do you still have the issue?

rnyak avatar Jun 23 '23 14:06 rnyak

TypeError: torch.Size() takes an iterable of 'int' (item 1 is 'NoneType') i am getting this issue when I am applying Xlnet on movielens dataset, and my schema is schema = schema.select_by_name( ['userId', 'movieId','genres'] ).. please let me know how to resolve it

NamartaVij avatar Jun 30 '23 20:06 NamartaVij

@NamartaVij we need more info to help you. what's your NVT and model script? can you please share here? We need to see you are doing the tagging properly. Please note that you should have sequential (list) columns, and you should tag it as ITEMID.

rnyak avatar Jul 02 '23 15:07 rnyak

@sparta0000 do you still have the issue?

@rnyak thanks for replying, Actually i was evaluating different algos at that time and since this didn't worked i finalised RecVAE for my work as it was performing good. will try it next time and connect if i needed help. Thank you very much

csingh03 avatar Jul 04 '23 08:07 csingh03