Ads-RecSys-Datasets icon indicating copy to clipboard operation
Ads-RecSys-Datasets copied to clipboard

ipinyou data size and base line code

Open Sandy4321 opened this issue 6 years ago • 5 comments

zipped ipinyou is 249 MB and uzipeed 1.5 gb in https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing

but https://github.com/wnzhang/make-ipinyou-data stated that

After the program finished, the total size of the folder will be 14G.

so it is because hdf5 data in https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing so small , or some clarifications needed?

I understand that due to removed the user-tag feature considering leaky problems some data file reduction happen as well

May you please share some baseline code to try this data then everything will be clear

image

image

Sandy4321 avatar Jan 27 '20 16:01 Sandy4321

hdf5 is a compressed file format you should check the number of examples instead of file size I have shared all baselines compared in my papers, see https://github.com/Atomu2014/product-nets and https://github.com/Atomu2014/product-nets-distributed

Atomu2014 avatar Jan 27 '20 19:01 Atomu2014

great thanks a lot but I am looking for really simple python baseline without complicated packages as TF do you have one or do you know somebody who has performance is not important , I try just learn for very beginning ?

Sandy4321 avatar Jan 27 '20 20:01 Sandy4321

Hi, I suggest you can try these packages: xgboost > libfm > libffm search them on the Internet and find the official guide these packages are easy to try since you don't need to touch the model, and the only thing yous should do is just preparing the data and call API / CLI

Atomu2014 avatar Jan 27 '20 21:01 Atomu2014

great so where to get prepossessed Criteo data set? per The original dataset is know as Criteo 1TB click log, in which the CriteoLab has collected 30 days of masked data. We only know there are 13 numerical and 26 categorical features, and there is no feature description released. Thus we name thease features as num_0 ... num_12, and cat_0 ..., cat_25.

Sandy4321 avatar Mar 22 '20 21:03 Sandy4321

Hi, there are 2 download links in the "Download" section of README. The processed dataset only contains 8 days' logs.

Atomu2014 avatar Mar 22 '20 21:03 Atomu2014