fakir
fakir copied to clipboard
New dataset ideas
Idea for new fake datasets:
- [ ] Movies
- [ ] Sport (maybe basket)
- [ ] Product review (a la amazon)
- [ ] Tweets
- [ ] Pharma related
Hi, Great package, would be greater with fake basket (to work on recommendation)
Hi ! Looks like somthing I can try ... Are you looking for datasets that look like real ones but with fake data (eg fake movie titles) ?
@KittJonathan yes, the idea is to have functions to generate datasets that looks like real ones but which are not :)
We use them for teaching and for prototyping shiny applications 😇
I would like to add that as datasets are used for teaching, it is better if they have characteristics that force to use some specific functions for data wrangling, cleaning, plots...
We do not ask for all the following characteristics to be addressed of course, but you can keep in mind what package / part of R would you like to teach with this specific dataset ?
If you have a look at examples in the README, you will see some datasets:
- with missing values
- with dates to clean
- character, numeric classes
- requiring join
- requiring pivot functions
- with non random distribution of numeric values or number per class allowing for meaningful use of
geom_*in {ggplot2} - from specific fields that are of interest for some attendees as similar to their every day datasets, or hobbies
Thank you in advance for your contribution.
Thanks for the tips ! As a fan of Dr Who, I was pondering the idea of creating a fake table of episodes, with dates, ratings, characters, ...
Oops sorry, I misunderstood the issue ... I thought it was about creating fake datasets in table format, not creating functions to create datasets, which seems a little too much for me. Sorry about that!
Hi,
I could take up creating a function for creating fake movie datasets. My idea for the output so far is a data.frame looking like this:
title- possibly with the option of sequels (which should be a copy of a row with some added noise in the numeric variables)genres- string with one or more genres for a single movie separated (e.g.drama, action, sci-fi), possibly with some combinations more likely than othersrating- a number between 0-10cinema_release_date- possibly with some non-random missing data to simulate lockdowns?budget- $revenue- $, probably should be weakly correlated with budgetruntime- in the format2h 35minlanguage
What do you think?