datawizard icon indicating copy to clipboard operation
datawizard copied to clipboard

Consolidate `coerce_to_numeric()` into `to_numeric()` via a new argument

Open IndrajeetPatil opened this issue 3 years ago • 14 comments

IndrajeetPatil avatar Jul 22 '22 23:07 IndrajeetPatil

I think the currently existing argument preserve_levels is actually most appropriate to trigger coerce_to_numeric(), but then we have to find a new name for preserve_levels.

strengejacke avatar Jul 31 '22 15:07 strengejacke

@bwiernik @vincentarelbundock what do you think? Do you have suggestions for argument names?

Currently, the preserve_labels argument only works for factors with numeric levels. If preserve_levels = FALSE, the returned numeric vector starts with 1. If preserve_levels = TRUE, the returned numeric vector contains the factor levels, coerced to numeric.

If a vector has character or string levels, preserve_levels has no effects and always a numeric vector, starting at 1, is returned. However, sometimes you want to convert only factors with numeric levels into numeric vectors, and factors with character levels should be left unchanged. This is what coerce_to_numeric() does. However, it would be nice to integrate that function into to_numeric(), and just have a new argument.

I think the currently existing argument preserve_levels is actually most appropriate to trigger coerce_to_numeric(), but then we have to find a new name for preserve_levels.

Suggestions?

library(datawizard)
f1 <- factor(c(1, 2, 3, 4))
f2 <- factor(c(5, 6, 7 , 8))

to_numeric(f1, dummy_factors = FALSE)
#> [1] 1 2 3 4
to_numeric(f2, dummy_factors = FALSE)
#> [1] 1 2 3 4
to_numeric(f2, dummy_factors = FALSE, preserve_levels = TRUE)
#> [1] 5 6 7 8


f1 <- factor(c("a", "b", "c", "d"))
to_numeric(f1, dummy_factors = FALSE)
#> [1] 1 2 3 4
to_numeric(f1, dummy_factors = FALSE, preserve_levels = TRUE)
#> [1] 1 2 3 4

# which argument for `to_numeric()` could behave like `coerce_to_numeric()`?
coerce_to_numeric(f1)
#> [1] a b c d
#> Levels: a b c d

Created on 2022-08-05 by the reprex package (v2.0.1)

strengejacke avatar Aug 05 '22 12:08 strengejacke

Sorry, I'm probably not the right person to ask. I'm a grumpy old man who likes base R as.numeric

😣

vincentarelbundock avatar Aug 05 '22 12:08 vincentarelbundock

I'm not asking you about the functionality ;-) I'd like to know from a native speaker which argument name sounds best for this task :-) And you did your Bsc in 2007, you can't be older than me. 😝

strengejacke avatar Aug 05 '22 13:08 strengejacke

Honestly, I think I just want to challenge the premise of the question: I think a function called to_numeric() should always return a numeric, and it feels like not great design for the output class to depend on the input class and/or arguments. If users want a character, they should call to_character()

vincentarelbundock avatar Aug 05 '22 13:08 vincentarelbundock

The reason for this discussion is that we had several internal .to_numeric() functions that should be replaced by the datawizard function, to avoid spread of code and increase maintainability. However, the internal functions require different handlings. In some cases, to_numeric(factor(c("a", "b", "c", "d"))) should return a numeric, in other cases the original factor, if levels cannot be coerced to numeric.

So we need an additional argument, unless you want to maintain all our internal functions. :-)

strengejacke avatar Aug 05 '22 13:08 strengejacke

I would argue that in those cases clarity and maintainability is best achieved by a standard as.numeric(as.character(x)) or somesuch.

Being more verbose and sticking with base R is often preferable, IMHO.

But obviously, I haven't seen all the use cases, so you know better.

And to be clear, this isn't a big deal to me; I just have difficulty to not give opinions.

vincentarelbundock avatar Aug 05 '22 13:08 vincentarelbundock

I generally agree with Vincent that to_numeric() should consistently

If an argument is used to trigger this conditional behavior, it should be very explicit that this is what the argument does, and it should be independent of the preserve_levels behavior. Something like .only_numberlike (defaulting to false).

I would prefer a separate function like maybe_to_numeric(), but would be okay with an argument like .only_numberlike.

bwiernik avatar Aug 05 '22 13:08 bwiernik

maybe_to_numeric() is the current coerce_to_numeric() (if I understood correctly). So you would opt to not consolidate the two present functions, like suggested in this issue?

strengejacke avatar Aug 05 '22 14:08 strengejacke

I really don't understand the context where to_numeric(factor(c("a", "b", "c", "d"))) should be expected to return a, b, c, d? Why are we trying to convert to numeric and not expecting to always have numeric?

bwiernik avatar Aug 05 '22 22:08 bwiernik

I don't know, but it's used in modelbased for example: https://github.com/search?p=2&q=org%3Aeasystats+to_numeric&type=Code

strengejacke avatar Aug 06 '22 05:08 strengejacke

So coerce_to_numeric() is used only in one place in {easystats} ecosystem, which is in {datawizard} itself:

https://github.com/easystats/datawizard/blob/34cf6bf73cb974a5ab1c17318a4f6b34d3744752/R/data_reshape.R#L151

If we change this to something like

data[["_Row"]] <- as.numeric(row.names(data)) 

we can remove this function. WDYT?

IndrajeetPatil avatar Aug 20 '22 16:08 IndrajeetPatil

Its usage in {modelbased} is covered in a separate issue: https://github.com/easystats/modelbased/issues/206

IndrajeetPatil avatar Aug 20 '22 16:08 IndrajeetPatil

we can remove this function. WDYT?

No, make to internal instead. Else reshape_longer() no longer works.

strengejacke avatar Aug 21 '22 07:08 strengejacke