COOL icon indicating copy to clipboard operation
COOL copied to clipboard

Discussion about using current example dataset to generate cohort query

Open Zrealshadow opened this issue 2 years ago • 0 comments

We want to generate cohort query from sogamo dataset for cohortQueryProcessing unittest. Through some simple data analysis, there some problems. we found that:

In sogamo dataset, there are only 4 players in the entire dataset which contains 10k items. Thus the cohort query in old-version code is not representative. It can not work well as a unittest. According to the CoHANA paper, the raw data is larger than the sample data current we have. I recommend use raw data to generate test cohort query.

In tpch dataset, there is a same problem. There is only 1 user in the entire dataset. Total order in this datasets is about the same user.

Zrealshadow avatar Aug 14 '22 08:08 Zrealshadow