typed More restrictions for Data.frame()

Hi all, I think would be great have the next features on data.frames:

The df must contains this column names
The df can only contain this column names
The df must contain only and all of them this column names
Set a type of data per column
If empty df is fine or not, since we cannot create an empty df with column names, be able to have an empty df and then modifying it check the restrictions above, or if an empty df is not acceptable from the start.

Thx!

Aug 17 '23 21:08 latot

This would be in part answered by #26 (restrict the type using a prototype in place of an assertion)

Or it can be done by using arguments in Data.frame()

library(typed)
#> 
#> Attaching package: 'typed'
#> The following object is masked from 'package:utils':
#> 
#>     ?
# The df must contains these column names
Data.frame(... = "wrong column names" ~ all(c("speed", "dist") %in% names(value))) ? x
x <- iris
#> Error: wrong column names
#> `all(c("speed", "dist") %in% names(value))`: FALSE
#> `expected`:                                  TRUE
x <- cars

# The df can only contain this column names
Data.frame(... = "wrong column names" ~ all(names(value) %in% c("speed", "dist", "foo"))) ? x
x <- iris
#> Error: wrong column names
#> `all(names(value) %in% c("speed", "dist", "foo"))`: FALSE
#> `expected`:                                         TRUE
x <- cars

# The df must contain only and all of them this column names
Data.frame(... = "wrong column names" ~ identical(sort(names(value)), c("dist", "speed"))) ? x
x <- iris
#> Error: wrong column names
#> `identical(sort(names(value)), c("dist", "speed"))`: FALSE
#> `expected`:                                          TRUE
x <- cars

# Set a type of data per column
Data.frame(... = "wrong col types" ~ is.numeric(value$speed) && is.numeric(value$dist)) ? x
x <- iris
#> Error: wrong prototype
#> `is.numeric(value$speed) && is.numeric(value$dist)`: FALSE
#> `expected`:                                          TRUE
x <- cars

^{Created on 2023-09-13 with reprex v2.0.2}

Sep 13 '23 00:09 moodymudskipper

This also doesn't look bad :

library(typed)
#> 
#> Attaching package: 'typed'
#> The following object is masked from 'package:utils':
#> 
#>     ?
library(vctrs)
Data.frame(vec_ptype = data.frame(speed = numeric(0), dist = numeric(0))) ? x
x <- iris           
#> Error: `vec_ptype` mismatch
#> `vec_ptype(value)` is length 5
#> `expected` is length 2`vec_ptype` mismatch
#>     names(vec_ptype(value)) | names(expected)    
#> [1] "Sepal.Length"          - "speed"         [1]
#> [2] "Sepal.Width"           - "dist"          [2]
#> [3] "Petal.Length"          -                    
#> [4] "Petal.Width"           -                    
#> [5] "Species"               -                    `vec_ptype` mismatch
#> `vec_ptype(value)$Sepal.Length` is a double vector ()
#> `expected$Sepal.Length` is absent`vec_ptype` mismatch
#> `vec_ptype(value)$Sepal.Width` is a double vector ()
#> `expected$Sepal.Width` is absent`vec_ptype` mismatch
#> `vec_ptype(value)$Petal.Length` is a double vector ()
#> `expected$Petal.Length` is absent`vec_ptype` mismatch
#> `vec_ptype(value)$Petal.Width` is a double vector ()
#> `expected$Petal.Width` is absent`vec_ptype` mismatch
#> `vec_ptype(value)$Species` is an S3 object of class <factor>, an integer vector
#> `expected$Species` is absent`vec_ptype` mismatch
#> `vec_ptype(value)$speed` is absent
#> `expected$speed` is a double vector ()`vec_ptype` mismatch
#> `vec_ptype(value)$dist` is absent
#> `expected$dist` is a double vector ()
x <- cars

^{Created on 2023-09-13 with reprex v2.0.2}

I could make it a special argument to highlight the feature, to not need library(vctrs), and to have a more readable error. But it's redundant with #26 so not sure if it's worth it.

Sep 13 '23 00:09 moodymudskipper

Mmm, I think

This would be in part answered by #26 (restrict the type using a prototype in place of an assertion)

Or it can be done by using arguments in Data.frame()

library(typed)
#> 
#> Attaching package: 'typed'
#> The following object is masked from 'package:utils':
#> 
#>     ?
# The df must contains these column names
Data.frame(... = "wrong column names" ~ all(c("speed", "dist") %in% names(value))) ? x
x <- iris
#> Error: wrong column names
#> `all(c("speed", "dist") %in% names(value))`: FALSE
#> `expected`:                                  TRUE
x <- cars

# The df can only contain this column names
Data.frame(... = "wrong column names" ~ all(names(value) %in% c("speed", "dist", "foo"))) ? x
x <- iris
#> Error: wrong column names
#> `all(names(value) %in% c("speed", "dist", "foo"))`: FALSE
#> `expected`:                                         TRUE
x <- cars

# The df must contain only and all of them this column names
Data.frame(... = "wrong column names" ~ identical(sort(names(value)), c("dist", "speed"))) ? x
x <- iris
#> Error: wrong column names
#> `identical(sort(names(value)), c("dist", "speed"))`: FALSE
#> `expected`:                                          TRUE
x <- cars

# Set a type of data per column
Data.frame(... = "wrong col types" ~ is.numeric(value$speed) && is.numeric(value$dist)) ? x
x <- iris
#> Error: wrong prototype
#> `is.numeric(value$speed) && is.numeric(value$dist)`: FALSE
#> `expected`:                                          TRUE
x <- cars

Created on 2023-09-13 with reprex v2.0.2

Oks, that is a lot of features I was not able to found, they are great.

Even with that, I still think this options would be better to have them as a params, there is two reasons, one is that in R, is not simple define a type, one advantage of something like typed::Character() is that it performs all the needed checks to know at least is a character, with the time there is always edge cases.

The character case could be a simpler one, but there is other more complex types, that format does pretty hard to do it, we would need to split in two the functions, one of type is.type and other for the typed function if needed, which seems redundant.

Remember that compare things on R is not always easy... so I think if is possible skip the user to "compare" things would be better.

Other point would be how to connect that with a custom assertion, with the time, I have been using custom assertions, inside more custom assertions due to some objects, when this happens eval a function like Data.frame(...., custom_part) now we are unable to touch that assertion if we need it, because is already evaluated.

I think is better to do something like this mode easy to handle, and even I think is easier if the logic follow how the lib present the formats.

I have done a little code for data frame, does something pretty similar to what you wrote, but I think helps to read more easily the assertions.

typed_data_frame.zip

Oks, I'm noob writing things on your lib, probs you would says there is better ways to achieve that, I just think this can be more intuitive:

typed_Data.frame(
  #set column types
  columns = list(
    id = typed::Integer(anyNA = FALSE, null_ok = FALSE),
    name = typed::Character(anyNA = FALSE, null_ok = FALSE),
    path = typed::List(
      each = typed::Integer(null_ok = FALSE, anyNA = FALSE)
    )
  ),
  #Check all the columns, or only the ones that exists
  select = "all_of",
  #There must exists only and all the declared columns
  only_cols = TRUE,
  #The df must have something
  empty_ok = FALSE,
  anyNA = FALSE
) ? df

Personally is ideal to use something like that, in particular, where we use the typed functions on the column types, feels so intuitive!

Is jut an option :D

About vctrs, I have done a fast and basic read about it, it is useful to use? I think yes. Is redundant to implement this issue on the package? I think no.

Oks, vctrs can help to handle types, clarify things, do things better, so why I think is not redundant?

Because if something like this is not implemented and depends on vctrs in that way means, "ppl must learn vctrs before to can use the typed package in a nice way"

Not all ppl knows vctrs, I just learned it here, and if you put that to the mix, is like ppl would says, "oks, too complex", the learn curve will be too high, I think implement that options will help to be able to use in a nice way the lib, down the learn curve, recommend learn vctrs, and if ppl learn it, then be able to use both at the same time.

Thx!

Sep 14 '23 20:09 latot

@moodymudskipper Would you agree changes like I proposed here? I can send a PR and you can review it.

Sep 28 '23 12:09 latot

I'm not sure yet. I don't want to feature bloat, I see the value but I made the package extensible so users could extend the types to their needs and have the package itself be centered around the concept and not specific type features.

I find the proposal above confusing regarding names and conflicting arguments. I see that I have included a "each" argument already and I almost regret it tbh. Maybe with great names and good simple api proposal I can change my mind but I'm not convinced yet.

Sep 28 '23 13:09 moodymudskipper

:O I see, so the intention of the lib is not have the "base" types for use, is more focused on the base to use custom types.

In that case, the actual uses are more like examples, also means, would be great have a second lib that exposes base types for use with some more options, at least common/useful ones.

Sep 28 '23 13:09 latot

Well I want the package to be usable as is without much customisation but there are compromises to make. That being said I think your request is a good candidate for a documented example on how to design a restricted type from data frame

Sep 28 '23 14:09 moodymudskipper

Just to know, which ones would be the main compromises of the lib? is good to know (and read it from the README) to see if focus a feature here or on a new lib for example.

I think the requested features here are key for use on dataframe, without them we need to move to customisation.

Sep 28 '23 20:09 latot

I'll think about it at the same time as #39 but it's close to last on the priorities.

I encourage you to implement the extension yourself meanwhile. I think you have everything here so that the definition of your custom data frame type should only take a few lines, and it might be improved if I get somewhere with #39.

Sep 28 '23 23:09 moodymudskipper