COOL
COOL copied to clipboard
Discussion about using current example dataset to generate cohort query
We want to generate cohort query from sogamo
dataset for cohortQueryProcessing unittest.
Through some simple data analysis, there some problems. we found that:
In sogamo
dataset, there are only 4 players in the entire dataset which contains 10k items. Thus the cohort query in old-version code is not representative. It can not work well as a unittest. According to the CoHANA paper, the raw data is larger than the sample data current we have. I recommend use raw data to generate test cohort query.
In tpch
dataset, there is a same problem. There is only 1 user in the entire dataset. Total order in this datasets is about the same user.