recsys_slates_dataset
recsys_slates_dataset copied to clipboard
Questions regarding slates and data attributes
Hi!
I have two questions regarding the meaning and content of some attributes of the dataset. It would be of gread help if you could sort them out for me. From my understanding:
- the array
slate_lengthscontains the number of items shown to the user. - the array
click_idxcontains the slate's position the user clicked. - All slates contain 25 items, in case of less items shown to users, they are padded with the id
0at the end of the slate.
My questions are regarding the data In the first record of the first user. In there, I find: (user_id=148, item_id=474955, click_idx=0, time_step=0, slate_lengths=26).
- Does this mean the user clicked on a non-shown item (due to click_idx=0) or that the clicked item is outside the 25-item slate?
- What does slate_lengths=26 mean? As all slates contain 25 items (or less), why does that attribute say 26?
Thanks in advance! Have a nice day :-D
Hi, Im answering from memory, and take it for granted your observed data point.
slate_lengths=26is peculiar indeed. I dont really have a good question here. Everything should be capped at 25.- The click_idx is really just a helper field. The true information can be found in
clickandslate. The click_idx just indicates the positon of the click item in the slate vector. So look at the click field instead, and I think you will see a noClick (as all the noClicks should be added before the "real" items).
Make sense?
Thanks for replying soon! 😊
slate_lengths=26is peculiar indeed. I dont really have a good question here. Everything should be capped at 25.
Oh! now I re-read the documentation and it says everything should be capped at 20, not 25. Is it possible that the documentation is outdated? Loading the data gives me an array of shape (2277645, 20, 25). My understanding is that 2277645 is the number of users, 20 are the timesteps, and 25 is the slate's length. Is that correct?
- The click_idx is really just a helper field. The true information can be found in
clickandslate. The click_idx just indicates the positon of the click item in the slate vector. So look at the click field instead, and I think you will see a noClick (as all the noClicks should be added before the "real" items).
Ok, I think I got it, so basically I need to find the item id (inclick) in the slate (in slate). In the example, the data is:
- user_id=148,
- click=474955,
- click_idx=0,
- time_step=0,
- slate_lengths=26,
- slate=[ 1, 476178, 2, 909795, 926004, 479095, 486912, 925038,
486758, 920889, 910049, 796287, 909642, 2, 901068, 2,
903854, 474347, 902583, 905483, 671134, 2, 821977, 2,
2
]
In this case, 474955 is not in the slate, is it possible that users clicked on a non-slate item or clicked on an item further positioned>25 of the slate, hence not included in the dataset? My calculations show there are 135384 cases like this.
I wrote a small reproducible script that you can check at https://gist.github.com/fernandobperezm/5aa67f63c728de704d5ca3082f10d82f. In there I assume the dataset is located in ./data/FINN-no-slates and the file with the dataset is data32.npz.