DataEditR icon indicating copy to clipboard operation
DataEditR copied to clipboard

Data validation

Open higgi13425 opened this issue 4 years ago • 5 comments

It would be great if you could

  1. fix the data type for each variable before you begin entering data, or repair them later if guessed wrong
  2. set valid ranges for each variable. i.e. systolic blood pressure in healthy adults, 70-160. Out of range values challenged, suggesting you change the range if this value is correct.
  3. Set up allowed values for factor variables - drop-down to limit to only these.

Standard paid data entry expects errors on 1-2% of fields. For every 100 fields, 1-2 errors. It adds up. It is a big deal. Lots of these could be prevented with fixed data types and value ranges. Common to get race : white, White, Caucasian, Black, black, African-American, African-american, aa, etc.

higgi13425 avatar Dec 22 '20 02:12 higgi13425

@higgi13425, thanks for the feedback!

  1. rhandsontable does support column types but unfortunately it removes the ability to add/remove rows/columns. To get around this, DataEditR does not assign column types at the rhandsontable level but instead sorts them out afterwards using utils::type.convert(). This means that class of the column is dependent on the data that is entered, i.e. if there is a character entered it will be converted to class character.

  2. As I mentioned, unfortunately rhandsontable does not and will never support slider inputs for cells. Newer versions of Handsontable come with licence restrictions and so rhandsontable uses a fixed and older version of Handsontable. Implementing this at the level of DataEditR is possible potentially through the col_options argument where these limits could be supplied. This does however, mean that the data with need to be checked with every edit which may be very inefficient. I can certainly play around with it and see whether it is worth implementing or not.

  3. This feature is already implemented! If you have specific factor levels for a column just pass them to the col-options for that column and a dropdown menu will appear. See below:

data_edit(iris,
          col_options = list(Species =  c("setosa", "virginica", "versicolor"))

Now that I think about it, it may be worthwhile supporting this as well: (we could grab the factor levels form the data directly)

data_edit(iris,
          col_options = list(Species = "dropdown")

DillonHammill avatar Jan 09 '21 05:01 DillonHammill

The dropdowns and checkboxes and date selection are great for data entry error prevention - but it would be awesome to somehow add limits to fields - min and max for weight/height/birthdate/systolicBP - so an out-of-range entry (e.g blood pressure of 1400) will be rejected. Error rates for data entry run 3-6% per cell. Pro-active error prevention is really important for valuable data.

higgi13425 avatar May 25 '21 01:05 higgi13425

@higgi13425, I like the idea of validating column entries.

I guess I could do something like this for numeric columns:

data_edit(mtcars,
          col_validate = list(vs = c(0,1)) # make sure vs values are between 0 and 1

Similarly for character columns:

data_edit(iris,
          col_validate = list(Species = c("setosa", 
                                          "versicolor", 
                                          "virginica")) # must match exactly

The main question is what do you expect to happen when the entered data does not match these requirements? Do we make the cell empty again?

Also I suspect that this would only be supported for columns that don't use checkboxes or dropdown menus.

Note to self: need to add NA as accepted entry for empty cells.

DillonHammill avatar May 25 '21 01:05 DillonHammill

Looking at my previous comments, I think it would be a better idea to extend this functionality to col_options() instead. The reason for this is that if levels are set for a character column then we should use dropdowns (user can still type and best match displayed) but for numeric columns we just check if the data is within range or remove it.

data_edit(iris,
          col_options = list(Species = c("setosa", "versicolor", "virginica"),   # dropdown
                             Sepal.Length = c(0, 10)))                           # range

The challenge will be in checking the edited data, particularly since the entire dataset is returned with each edit. I will need to look at the internals of rhandsontable to see if I can get information about specific edits and check those against the supplied range.

I will get to this eventually, but it is unlikely that I will have time to do this in the next couple of months.

DillonHammill avatar May 25 '21 02:05 DillonHammill

Adding a note to take a look at the pointblank package when I get time to address this request.

DillonHammill avatar May 27 '21 23:05 DillonHammill