dbldatagen icon indicating copy to clipboard operation
dbldatagen copied to clipboard

Enhancement: Generate standard data sets

Open ronanstokes-db opened this issue 4 years ago • 3 comments

Enhancement: Generate standard data sets

It would be useful to be able generate standard data sets without having to define columns etc for quick demos and benchmarking of different activities.

The goal would be to make it very easy to quickly generate a data set for benchmarking and other purposes without having to invest much time in learning the details of the data generation framework.

These could be modelled on standard public data sets such as those published as part of Kaggle challenges. For example standard data sets for customers, purchases, sales etc.

In particular, it would be useful for exploring CDC scenarios to be able to generate standard complementary data sets for both base line data and incremental data.

Proposed Behavior

import dbldatagen as dg

# define a standard data set for customers
testdata_generator = (dg.DataGenerator(spark, name="test_dataset", rows=100000, partitions=20)
                       .usingStandardDataset("customers")
                       )

df = testdata_generator.build()  # build our dataset

ronanstokes-db avatar Sep 23 '21 17:09 ronanstokes-db

@ronanstokes-db Hi Ronan,

Regarding standard datasets for benchmarking and PoC purposes, I have recently built a new package for PySpark which allows easy instant access to over 750+ standard datasets. with searchable capabilities. Check it out here:

Let me know if it suffices the requirements that you are looking for then we can integrate my library to dbldatagen 😃

Spratiher9 avatar Nov 01 '21 14:11 Spratiher9

Hi Souvik

Are these all sourced from open source projects with open source licenses ?

ronanstokes-db avatar Oct 04 '22 23:10 ronanstokes-db

Hi @ronanstokes-db

This datasets were distributed as free to use alongside the statistical software environment R and some of its add-on packages. I am porting the same datasets to be used from spark

However, For each dataset I have added the original owner and references details also:

For example,

Input

from sparkdataset import data

# Displaying documentation of a dataset
data('titanic', show_doc=True)

Output

titanic

SparkDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## titanic

### Description

The data is an observation-based version of the 1912 Titanic passenger
survival log,

### Usage

    data(titanic)

### Format

A data frame with 1316 observations on the following 4 variables.

`class`

a factor with levels `1st class` `2nd class` `3rd class` `crew`

`age`

a factor with levels `child` `adults`

`sex`

a factor with levels `women` `man`

`survived`

a factor with levels `no` `yes`

### Details

titanic is saved as a data frame. Used to assess risk ratios

### Source

Found in many other texts

### References

Hilbe, Joseph M (2014), Modeling Count Data, Cambridge University Press Hilbe,
Joseph M (2007, 2011), Negative Binomial Regression, Cambridge University
Press Hilbe, Joseph M (2009), Logistic Regression Models, Chapman & Hall/CRC

### Examples

    data(titanic)
    titanic$survival <- titanic$survived == "yes"
    glmlr <- glm(survival ~ age + sex + factor(class), family=binomial, data=titanic)
    summary(glmlr)

souvik-databricks avatar Oct 06 '22 07:10 souvik-databricks

Coming back to this - if there is already a package published, it does not seem to make sense to simply include it as part of the data generator unless we add some value to it - but happy to discuss offline

ronanstokes-db avatar Nov 18 '22 20:11 ronanstokes-db