GRID icon indicating copy to clipboard operation
GRID copied to clipboard

Data Preparation

Open seanlaw opened this issue 8 months ago • 2 comments

Hello and thank you for sharing this repository!

In your README, it says:

Prepare your dataset in the expected format:

data/ ├── train/ # training sequence of user history ├── validation/ # validation sequence of user history ├── test/ # testing sequence of user history └── items/ # text of all items in the dataset

We provide pre-processed Amazon data explored in the P5 paper [4]. The data can be downloaded from > this google drive link.

I don't know if this matters but in the pre-processed Amazon data that is provided, the sub-directories are actually called training, evaluation, testing, and items.

Additionally, while tfrecord are being used to store the data, it isn't clear what format the raw/original data needs to be in so that it can properly converted to a tfrecord (i.e., what features, strings, etc) so that it can be correctly read/handled/processed by GRID. It would be super helpful if you could provide a simple example of, say, 5 records in a Pandas DataFrame (or a Python Dict with the required key/value pairs) and the steps that you take to generate the tfrecord.gz file. Otherwise, it would be very hard to leverage GRID beyond the the (opaque) pre-processed data.

Furthermore:

  1. I noticed that the tfrecord files foritems is called data_<num>.tfrecord.gz while the training, testing, and evaluation files are called partition_<num>.tfrecord.gz. Is there a reason for this difference?
  2. Each tfrecord example appears to have an embedding (feature) associated with it but I'm assuming that this isn't needed in our data since we are generating our own embeddings by passing the text (feature) to the tf-flan-xl model?
  3. It would be great to understand which features found in the tfrecord are actually being used/referenced in each of the stages mentioned in the README.

Thanks in advance for your time and consideration!

seanlaw avatar Aug 15 '25 15:08 seanlaw

Image

ZhengJiWei007 avatar Sep 11 '25 09:09 ZhengJiWei007

preprocessing_functions will filter features which not use , only keep 'id' , then get embedding by item id. in rkmeans train 。this embedding is generate by LLMs.

ZhengJiWei007 avatar Sep 11 '25 09:09 ZhengJiWei007