funspotr icon indicating copy to clipboard operation
funspotr copied to clipboard

Find functions passed as arguments

Open rjake opened this issue 2 years ago • 5 comments

This is the use case I mentioned at the RStudio conference. I'd like to find functions called as arguments in apply functions or in something like switch()

file_lines <- "
  sapply(mtcars, min)

  agg <- function(x, fun) {
    fn <- 
      switch(
        fun, 
        avg = mean,
        total = sum
      )
    
    fn(x, na.rm = TRUE)
  }

agg(1:3, 'total')
"

file_output <- tempfile(fileext = ".R")
writeLines(file_lines, file_output)
# A tibble: 4 x 2
  funs   pkgs     
  <chr>  <chr>    
1 sapply base     
2 switch base     
3 fn     (unknown)
4 agg    (unknown)

I am expecting to see min, mean and sum in the mix

rjake avatar Jul 28 '22 21:07 rjake

Ahh, yes, I understand now better. Correct, currently it will not identify any functions passed in through apply(), map(), or anything that's passed-in as an argument.

I'm not sure the way to have it capture this, but could look down a few potential paths...

  • Review the output after parsing and see if there is something there that would be easier to filter to these and have an option to keep them...
  • Alternatively have an argument where you can pass in functions or sets of functions and keep the arguments inputted, e.g. c("map*()", "*apply()") and then have some custom parsers to extract the .forFUN` values inputted to these

Feel free to open a PR or give a suggestion if you have an idea on how to handle. *

brshallo avatar Aug 01 '22 22:08 brshallo

I did some exploring and the functions get flagged as "symbol" when called this way. I tried a few approaches but nothing was robust and I wanted to jot down what I learned.

Altogether this was my best attempt
library(tidyverse)

file_lines <- "
  sapply(mtcars, min)
  min(1:10) -> a
  b = base::min(1:10)
  d <- plot <- print <- NULL
  agg <- function(x, fun) {
    fn <- 
      switch(
        fun, 
        avg = mean,
        total = sum
      )
    
    fn(x, na.rm = TRUE)
  }

  agg(1:3, 'total')
"

exprs <- parse(text = file_lines)

parsed_df <- # standardize call, must be a better way to do this
	exprs |> 
	as.character() |> 
	parse(text = _) |> 
	utils::getParseData(includeText = TRUE) |> 
	as_tibble() |> 
	print()
# github added all the 4-space tabs, I'm a 2-space guy 🚀 

parsed_df |> 
	mutate(expr = cumsum(parent == 0)) |> 
	# not sure this fully drops formals
	mutate(
		is_formal = token %in% c("SYMBOL_FORMALS"),
		is_new = lead(token, 2) %in% c("EQ_ASSIGN", "LEFT_ASSIGN")
	) |> 
	filter(token %in% c("SYMBOL_FUNCTION_CALL", "SYMBOL")) |> 
	# not sure this is robust to drop new items used later
	group_by(text) |> 
	filter(cumsum(is_new) == 0) |> 
	ungroup() |> 
	# shorten
	distinct(text) |> 
	# figure out what things are
	mutate(
		from = map_chr(
			text, 
			~find(.x) |> pluck(1) |> toString()
		),
		is = map_chr(
			text, 
			~get0(.x) |> class() |> toString()
		)
	) |> 
	filter(is == "function" & from != ".GlobalEnv")

#   text   from         is      
#   <chr>  <chr>        <chr>   
# 1 sapply package:base function
# 2 min    package:base function
# 3 switch package:base function
# 4 mean   package:base function
# 5 sum    package:base function
All the caveats & info I found

We could find all symbol objects in addition to symbol_function_call objects then inspect them. This casts a very wide net and I think requires the libraries to be attached to know where everything is coming from. https://github.com/brshallo/funspotr/blob/e529b00d4fb6a243e236b86b4283394e23f4feb8/R/spot-funs.R#L11-L17

purrr::map_chr(
  c("min", "mtcars"), 
  ~get0(.x) |> class()
)
# [1] "function"   "data.frame"

Another problem with using the method above is formals being in the body of a function coming through as symbols. They start as symbol_formals in getParseData() but then become symbol when called later. globals::findGlobals() is good for this use case but it doesn't catch that agg was made in step 2

exprs <- parse(text = file_lines)
map(exprs, globals::findGlobals)
# [[1]]
# [1] "sapply" "mtcars" "min"   
# 
# [[2]]
# [1] "<-"     "{"      "switch" "mean"   "sum"   
# 
# [[3]]
# [1] "agg" ":" 

Another gotcha is an assignment like 1:3 -> min is hard to follow in getParseData() but when treated as an expression, it puts it in the right order

parse(text = "1:3 -> min") |> as.character()
# [1] "min <- 1:3"

For this you could do something weird like below to standardize the call then use getParseData(includeText = TRUE) This argument will mark the expression call as parent = 0 when they occur in the global environment. You can then use cumsum() to find each new time that occurs.

parse(text = "1 -> min; a = 23") |>
  as.character() |> 
  parse(text = _) |> 
  getParseData(includeText = TRUE) |> 
  mutate(expr = cumsum(parent == 0))

I liked the FUN or .f idea but there could be cases where more than one argument is passed (like passing fct_reorder that then uses sum) and there are other ways it gets called .funs .fun, etc.

There is a lot of info here re: Abstract Syntax Trees (AST).

Some of these use cases are pretty niche maybe there is some low-hanging fruit in the mix? I'll try to take a closer look later.

rjake avatar Aug 02 '22 03:08 rjake

This is great, thanks for putting together. A few questions / comments:


Do you need the whole exprs |> as.character |> parse ... or would it just work using utils::getParseData() directly. I.e.

file_lines <- "
  sapply(mtcars, min)
  min(1:10) -> a
  b = base::min(1:10)
  d <- plot <- print <- NULL
  agg <- function(x, fun) {
    fn <- 
      switch(
        fun, 
        avg = mean,
        total = sum
      )
    
    fn(x, na.rm = TRUE)
  }

  agg(1:3, 'total')
"

file_output <- tempfile(fileext = ".R")
writeLines(file_lines, file_output)
parsed_df <- utils::getParseData(parse(file_output, keep.source = TRUE)

Currently in funspotr I just do the latter and don't see why you'd need the extra set-up in this case even for your later filtering steps...?


I like your use of get0(.x) |> class() though don't see why you'd need toString(), does class() not always return a string?


One thing this makes me think more about is how I want to handle functions that aren't installed locally. Currently if you have madeupfun(x) in your script "madeupfun" will be returned in the funspotr output as a function and then the package it comes from will just show-up as "(unknown)" -- in the case of apply(x, madeupfun) though "madeupfun" would almost certainly have to be dropped (as no easy way to differentiate it from x). This asynchrony is probably OK (a separate issue that should open, but am wondering if should set a warning in the case of "(unknown)" functions or an option to drop these from output).


(Also should consider how much computational overhead get0() adds... though probably not a ton compared to issues with my current method for loading/attaching packages for each file #14 .)

brshallo avatar Aug 11 '22 19:08 brshallo

Do you need the whole exprs |> as.character |> parse ... or would it just work using

Even though it is right-assigned min(1:10) -> a, R interprets it as a <- min(1:10) and using as.character() |> parse() will put it in the right order when I then pass it to getParseData(). I'm not really sure how to interact with the id & parent columns. Maybe there is a simpler way to do this.


I like your use of get0(.x) |> class() though don't see why you'd need toString(), does class() not always return a string?

Ah, you're right, I don't need it for class(). I was trying to deal with null values in find() though and it looks like ~find(.x)[1] works just as well.

rjake avatar Aug 11 '22 20:08 rjake

@rjake did not get to this before pushing to CRAN.

Given the ambiguity here, one approach I was considering was making it so you could have an argument that allowed you to specify an approach. So say you want to do something like your approach above, you could pass this in as a function into a parser argument (or something like that). funspotr's value-add in this situation would be setting-up the search space but then it gives the user flexibility to identify the symbols to check against the search space. Perhaps funspotr would have some in-built approaches available but it also keeps me from actually having to set this up (or worrying about the ambiguities or partial solutions that may be "good enough" for someone's use-case).

Actually setting this up though would require me to untangle a few steps in the current approach (that I may or may not get to).

brshallo avatar Oct 31 '23 06:10 brshallo