lintr icon indicating copy to clipboard operation
lintr copied to clipboard

Dplyr, specify what is column, and what is variable

Open latot opened this issue 2 years ago • 10 comments

Hi hi, I think add to the linter as a good practice, specify in dplyr filters and similar what is column and what is var, like this example:

data <- as.data.frame(list(b=seq(10)))
b <- 5
#In this case "b" is both (left and right) the variable, so is TRUE
dplyr::filter(a, b==b)
    b
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
10 10

Is very hard to read code when we don't know what is column, or what is a variable, because we can compare two columns too...

Well, after checking and checking, how can we tell dplyr what is column and what is var, here the results:

data <- as.data.frame(list(b=seq(10)))
b <- 5
col <- "b"
#!!b will be recognized as a var, !!5 too!, b will be recognized as a column
dplyr::filter(a, b==!!b)

#!!dplyr::sym(col) the content of col will be recognized as a column, while !!b will be recognized as a variable
dplyr::filter(a, !!dplyr::sym(col)==!!b)

The criteria would be the next one, if we are using a variable inside a filter or similar, it must always have the !! or in case we want to use the content as a column the !!dplyr::sym(var)

Thx!

latot avatar May 30 '22 20:05 latot

To me personally, I would rather someone just use base syntax than make their code so complicated:

a[a[[col]] == b], ]

conveys this pretty unambiguously since [.data.frame doesn't allow mixing scopes like dplyr::filter() does.

that said, maybe others are more keen to keep things tidyverse-only in their code. if there is more general interest in such a linter, we'd be happy to review a PR.

MichaelChirico avatar May 31 '22 03:05 MichaelChirico

The canonical way of achieving this in tidyverse is usinge the .env and .data verbs.

a %>% filter(.data$b == .env$b)

AshesITR avatar May 31 '22 05:05 AshesITR

@MichaelChirico Hi, yes, that is very clear, sadly some times, like here, we work with ppl that is not always fully technical, so is pretty important the code to be easy to understand.

I think the solution of @AshesITR is great, we can access to anything in a clear way, I think would be great have this in the linter.

latot avatar May 31 '22 14:05 latot

Not sure we can do much about that. It's not really feasible to statically detect name clashes between outer scope and columns of a data frame unless the latter is trivially generated within the code.

A PR would be welcome though, if you manage to do that.

AshesITR avatar May 31 '22 16:05 AshesITR

IIUC the desired linter blocks all plain symbol usage, i.e., everything must be .data or .env, which IINM could be done statically

MichaelChirico avatar May 31 '22 16:05 MichaelChirico

But surely you don't want to lint e.g. this:

df %>% mutate(
  my_col = purrr::map_dbl(.data$my_other_col, my_fun),
  my_new_col = mean(.data$my_col_too, na.rm = T)
)

Also not sure if that wouldn't make even simple data transformation code borderline unreadable due to the extreme verbosity.

AshesITR avatar May 31 '22 17:05 AshesITR

Sadly, yes, I think that example should be linted, pick this example:

B <- 10
df %>% mutate(
  my_col = purrr::map_dbl(A*B, my_fun),
  my_new_col = mean(B, na.rm = T)
)

I know extreme verbosity is not ideal, like the first examples I wrote to do this works, are ugly and hard to understand, but I think the first priority is clarify, be easy to understand and do it works.

B <- 10
df %>% mutate(
  my_col = purrr::map_dbl(.data$A*.env$B, my_fun),
  my_new_col = mean(.data$B, na.rm = T)
)

This is more verbose, but at least, we would be able to know what the code is trying to do.

latot avatar May 31 '22 18:05 latot

@latot but I'm not using any B in my example.

The actual things that would cause lints, because they are symbols not preceded by .data or .env, are my_fun, T, and possibly mean.

So what you'd have to write if you lint all symbols would be

df %>% mutate(
  my_col = purrr::map_dbl(.data$my_other_col, .env$my_fun),
  my_new_col = base::mean(.data$my_col_too, na.rm = .env$T)
)

Way too verbose for a trivial task in my opinion.

AshesITR avatar May 31 '22 21:05 AshesITR

if we lint T->TRUE and only block on SYMBOL (not SUMBOL_FUNCTION_CALL), we're down to only needing .env$my_fun.

possibly that could be eliminated through a carve out for base/purrr mappers. inverse logic applies to the future character_only_linter (current name) where we look inside mappers for library/require calls...

I'd still like to see broader support for this idea before accepting into linter.

MichaelChirico avatar May 31 '22 21:05 MichaelChirico

Hi, okis, poit by poin.

@AshesITR , your example don't have any B I added it just to show, there is similar cases where we would need linting there, as all the things, there is some cases where we can understand fine the code, and others not, the idea is show both and just simplify this.

With your second example, yes, that is...., a lot of verbose, but I think is not trivial, I'm checking a lot of code, where this is a real and a big problem, when we do this filters and transformations, we don't know what is what, is hard to understand when there is several variables/columns involved and helps to produce bugs.

@MichaelChirico Right, there is a lot that need to be discussed and checked before.

A way to do this..., is work by context, ex:

df %>% mutate(
  my_col = purrr::map_dbl(A*B, my_fun),
  my_new_col = mean(B, na.rm = TRUE)
)

In this case, in the context of the mutate there is no var called A or B, that means the only interpretation to it is as columns, while:

A <- 10
df %>% mutate(
  my_col = purrr::map_dbl(A*B, my_fun),
  my_new_col = mean(B, na.rm = TRUE)
)

In this case, if we see in the mutate we can't know if A is var or column, and in this case the linter can request to specify it. This would helps to reduce the verbose to the minimum.

Thx.

latot avatar May 31 '22 22:05 latot

Looking again, I encourage this as a custom linter in downstream. Unless {tidyverse} authors themselves tout this approach, I think it's out of scope.

MichaelChirico avatar Sep 20 '23 16:09 MichaelChirico