Enhancement: Generate standard data sets
Enhancement: Generate standard data sets
It would be useful to be able generate standard data sets without having to define columns etc for quick demos and benchmarking of different activities.
The goal would be to make it very easy to quickly generate a data set for benchmarking and other purposes without having to invest much time in learning the details of the data generation framework.
These could be modelled on standard public data sets such as those published as part of Kaggle challenges. For example standard data sets for customers, purchases, sales etc.
In particular, it would be useful for exploring CDC scenarios to be able to generate standard complementary data sets for both base line data and incremental data.
Proposed Behavior
import dbldatagen as dg
# define a standard data set for customers
testdata_generator = (dg.DataGenerator(spark, name="test_dataset", rows=100000, partitions=20)
.usingStandardDataset("customers")
)
df = testdata_generator.build() # build our dataset
@ronanstokes-db Hi Ronan,
Regarding standard datasets for benchmarking and PoC purposes, I have recently built a new package for PySpark which allows easy instant access to over 750+ standard datasets. with searchable capabilities. Check it out here:
Let me know if it suffices the requirements that you are looking for then we can integrate my library to dbldatagen 😃
Hi Souvik
Are these all sourced from open source projects with open source licenses ?
Hi @ronanstokes-db
This datasets were distributed as free to use alongside the statistical software environment R and some of its add-on packages. I am porting the same datasets to be used from spark
However, For each dataset I have added the original owner and references details also:
For example,
Input
from sparkdataset import data
# Displaying documentation of a dataset
data('titanic', show_doc=True)
Output
titanic
SparkDataset Documentation (adopted from R Documentation. The displayed examples are in R)
## titanic
### Description
The data is an observation-based version of the 1912 Titanic passenger
survival log,
### Usage
data(titanic)
### Format
A data frame with 1316 observations on the following 4 variables.
`class`
a factor with levels `1st class` `2nd class` `3rd class` `crew`
`age`
a factor with levels `child` `adults`
`sex`
a factor with levels `women` `man`
`survived`
a factor with levels `no` `yes`
### Details
titanic is saved as a data frame. Used to assess risk ratios
### Source
Found in many other texts
### References
Hilbe, Joseph M (2014), Modeling Count Data, Cambridge University Press Hilbe,
Joseph M (2007, 2011), Negative Binomial Regression, Cambridge University
Press Hilbe, Joseph M (2009), Logistic Regression Models, Chapman & Hall/CRC
### Examples
data(titanic)
titanic$survival <- titanic$survived == "yes"
glmlr <- glm(survival ~ age + sex + factor(class), family=binomial, data=titanic)
summary(glmlr)
Coming back to this - if there is already a package published, it does not seem to make sense to simply include it as part of the data generator unless we add some value to it - but happy to discuss offline