lintr icon indicating copy to clipboard operation
lintr copied to clipboard

duplicate_argument_linter should skip on mutate() calls

Open MichaelChirico opened this issue 2 years ago • 4 comments

dplyr::mutate() allows sequential updates to variables, e.g.

x %>%
  dplyr::mutate(
    col = gsub("t", "", col),
    col = gsub("\\s+$", "xxx", col)
  )

but duplicate_argument_linter() sees col used twice and throws.

Not throwing the 3.0.0 milestone quite yet...

MichaelChirico avatar Jun 01 '22 23:06 MichaelChirico

mutate and transmute(?) could be added to the default exclude list.

OTOH I don't like this style and, at least on older versions of dbplyr, that backend actually didn't work with this kind of code, since reusing columns in SQL isn't possible. So I'm slightly in favor of linting this by default.

WDYT?

AshesITR avatar Jun 02 '22 05:06 AshesITR

The problem IMO is every mutate() copies the table again which is quite wasteful in general. For dbplyr, it's not as wasteful since those backends are typically lazy.

We have a ConsecutiveMutateLinter() for precisely this, and it attempts a carveout for dbplyr by looking for library(dbplyr) in the script.

Maybe we should have @hadley or @DavisVaughan weigh in here about dplyr best practices.

MichaelChirico avatar Jun 02 '22 05:06 MichaelChirico

I personally would convert the above to use pipe:

x |> 
  dplyr::mutate(
    col = col |> str_replace("t", "") |> str_replace("\\s+$", "xxx")
  )

But I don't thinking mutating a variable multiple times is an anti-pattern. dplyr copies the column involved, so it's not quite as expensive as you might think.

hadley avatar Jun 02 '22 12:06 hadley

OK, looks like an allowlist is the way to go then. not tagging for 3.0.0 release either.

MichaelChirico avatar Jun 02 '22 15:06 MichaelChirico