openrefine-socialsci icon indicating copy to clipboard operation
openrefine-socialsci copied to clipboard

Dataset should be messier and larger

Open bencomp opened this issue 2 years ago • 7 comments

The prepared SAFI dataset is, I think, not messy enough to really show OpenRefine's power.

I would like:

  • several cells with leading or trailing whitespace in rows that are far apart (to use with "Trim leading and trailing whitespace" transform)
  • more variants in the village names or another column (to use with clustering)
  • a date that is a clear outlier (to find using a timeline facet)
  • non-numeric data in a column that should be numeric (to find using a numeric facet)

See also #35. The number of columns doesn't make the data messy.


Summarising the to-dos from the discussion below:

  • [ ] Add leading and trailing spaces to (let's say 10) cells in the village and respondent_roof_type columns in rows that are far apart
  • [ ] Add accents, spaces in the middle of names, or typos to several cells in the village column
  • [ ] Change a few date values to a different date format (making sure that parsing works correctly, or fails completely, so that you don't think everything worked when it did not)
  • [ ] Change the year on a date value to make that value an (obvious) outlier, e.g., December 2016 becomes December 2017
  • [ ] Change a numeric value to a non-numeric value like NULL
  • [ ] Change a numeric (missing?) value to -99
  • [ ] Add a step on setting the character encoding to the project creation section

While we are making changes, I think this should (or could) be part of the update too, even though it was part of #29 and not explicitly mentioned now:

  • [ ] Add more rows from the original dataset

bencomp avatar Jun 23 '22 21:06 bencomp