fakir icon indicating copy to clipboard operation
fakir copied to clipboard

New dataset ideas

Open ColinFay opened this issue 6 years ago • 7 comments

Idea for new fake datasets:

  • [ ] Movies
  • [ ] Sport (maybe basket)
  • [ ] Product review (a la amazon)
  • [ ] Tweets
  • [ ] Pharma related

ColinFay avatar Sep 02 '19 19:09 ColinFay

Hi, Great package, would be greater with fake basket (to work on recommendation)

aboulaboul avatar Sep 06 '19 12:09 aboulaboul

Hi ! Looks like somthing I can try ... Are you looking for datasets that look like real ones but with fake data (eg fake movie titles) ?

KittJonathan avatar Oct 01 '21 14:10 KittJonathan

@KittJonathan yes, the idea is to have functions to generate datasets that looks like real ones but which are not :)

We use them for teaching and for prototyping shiny applications 😇

ColinFay avatar Oct 01 '21 16:10 ColinFay

I would like to add that as datasets are used for teaching, it is better if they have characteristics that force to use some specific functions for data wrangling, cleaning, plots...

We do not ask for all the following characteristics to be addressed of course, but you can keep in mind what package / part of R would you like to teach with this specific dataset ?

If you have a look at examples in the README, you will see some datasets:

  • with missing values
  • with dates to clean
  • character, numeric classes
  • requiring join
  • requiring pivot functions
  • with non random distribution of numeric values or number per class allowing for meaningful use of geom_* in {ggplot2}
  • from specific fields that are of interest for some attendees as similar to their every day datasets, or hobbies

Thank you in advance for your contribution.

statnmap avatar Oct 03 '21 08:10 statnmap

Thanks for the tips ! As a fan of Dr Who, I was pondering the idea of creating a fake table of episodes, with dates, ratings, characters, ...

KittJonathan avatar Oct 03 '21 14:10 KittJonathan

Oops sorry, I misunderstood the issue ... I thought it was about creating fake datasets in table format, not creating functions to create datasets, which seems a little too much for me. Sorry about that!

KittJonathan avatar Oct 04 '21 06:10 KittJonathan

Hi,

I could take up creating a function for creating fake movie datasets. My idea for the output so far is a data.frame looking like this:

  • title - possibly with the option of sequels (which should be a copy of a row with some added noise in the numeric variables)
  • genres - string with one or more genres for a single movie separated (e.g. drama, action, sci-fi), possibly with some combinations more likely than others
  • rating - a number between 0-10
  • cinema_release_date - possibly with some non-random missing data to simulate lockdowns?
  • budget- $
  • revenue - $, probably should be weakly correlated with budget
  • runtime - in the format 2h 35min
  • language

What do you think?

skvrnami avatar Oct 10 '21 07:10 skvrnami