openrefine-socialsci
openrefine-socialsci copied to clipboard
Dataset should be messier and larger
The prepared SAFI dataset is, I think, not messy enough to really show OpenRefine's power.
I would like:
- several cells with leading or trailing whitespace in rows that are far apart (to use with "Trim leading and trailing whitespace" transform)
- more variants in the village names or another column (to use with clustering)
- a date that is a clear outlier (to find using a timeline facet)
- non-numeric data in a column that should be numeric (to find using a numeric facet)
See also #35. The number of columns doesn't make the data messy.
Summarising the to-dos from the discussion below:
- [ ] Add leading and trailing spaces to (let's say 10) cells in the
village
andrespondent_roof_type
columns in rows that are far apart - [ ] Add accents, spaces in the middle of names, or typos to several cells in the
village
column - [ ] Change a few date values to a different date format (making sure that parsing works correctly, or fails completely, so that you don't think everything worked when it did not)
- [ ] Change the year on a date value to make that value an (obvious) outlier, e.g., December 2016 becomes December 2017
- [ ] Change a numeric value to a non-numeric value like NULL
- [ ] Change a numeric (missing?) value to -99
- [ ] Add a step on setting the character encoding to the project creation section
While we are making changes, I think this should (or could) be part of the update too, even though it was part of #29 and not explicitly mentioned now:
- [ ] Add more rows from the original dataset