Data Preparation
Hello and thank you for sharing this repository!
In your README, it says:
Prepare your dataset in the expected format:
data/ ├── train/ # training sequence of user history ├── validation/ # validation sequence of user history ├── test/ # testing sequence of user history └── items/ # text of all items in the dataset
We provide pre-processed Amazon data explored in the P5 paper [4]. The data can be downloaded from > this google drive link.
I don't know if this matters but in the pre-processed Amazon data that is provided, the sub-directories are actually called training, evaluation, testing, and items.
Additionally, while tfrecord are being used to store the data, it isn't clear what format the raw/original data needs to be in so that it can properly converted to a tfrecord (i.e., what features, strings, etc) so that it can be correctly read/handled/processed by GRID. It would be super helpful if you could provide a simple example of, say, 5 records in a Pandas DataFrame (or a Python Dict with the required key/value pairs) and the steps that you take to generate the tfrecord.gz file. Otherwise, it would be very hard to leverage GRID beyond the the (opaque) pre-processed data.
Furthermore:
- I noticed that the
tfrecordfiles foritemsis calleddata_<num>.tfrecord.gzwhile thetraining,testing, andevaluationfiles are calledpartition_<num>.tfrecord.gz. Is there a reason for this difference? - Each
tfrecordexample appears to have anembedding(feature) associated with it but I'm assuming that this isn't needed in our data since we are generating our own embeddings by passing thetext(feature) to thetf-flan-xlmodel? - It would be great to understand which features found in the tfrecord are actually being used/referenced in each of the stages mentioned in the README.
Thanks in advance for your time and consideration!
preprocessing_functions will filter features which not use , only keep 'id' , then get embedding by item id. in rkmeans train 。this embedding is generate by LLMs.