tidyr icon indicating copy to clipboard operation
tidyr copied to clipboard

Should `*_wider()` have option to list new columns?

Open hadley opened this issue 3 years ago • 3 comments

I don't love that when you read code like this:

locations <- gmaps_cities |> 
  unnest_wider(json) |> 
  select(-status) |> 
  unnest_longer(results) |> 
  unnest_wider(results)

You have no idea what columns you'll end up with, and if the data structure changes, this code will continue to work just giving you new columns names that will cause a potentially confusing downstream failure.

It might be nice if you could specify the column names you expect:

locations <- gmaps_cities |> 
  unnest_wider(json, new = c("results", "status")) |> 
  select(-status) |> 
  unnest_longer(results) |> 
  unnest_wider(results, new = c("city", "address_components", "formatted_address", "geometry", "place_id", "types"))

OTOH, it seems unlikely that people would actually do this with out some helper that would automatically add to their code and we'd need to carefully think through the semantics of these columns (i.e. it feels like it shouldn't error if there were additional columns not included in the list?)

hadley avatar Jun 09 '22 13:06 hadley

OTOOH this would be nice for dbplyr, where we could avoid a query if we new the column names

hadley avatar Oct 19 '22 21:10 hadley

This does feel somewhat reasonable for programmatic type stability purposes. i.e. probably wouldn't be used interactively but I could see people hardcoding expected results in a function that calls unnest_wider()?

it feels like it shouldn't error if there were additional columns not included in the list?

I think I disagree. If I've opted in to taking the time to specify this argument, then I'd want the output columns to be exactly what I specified there, for programmatic purposes (which include the ncol() of the resulting data frame).

It would also probably be more useful for dbplyr if it errored? (I think it is the same "programmatic stability" argument)

DavisVaughan avatar Oct 31 '22 13:10 DavisVaughan

I read this because I searched for gmaps_cities so I could remember why it's in repurrrsive, which I'm about to release.

From the outside, it feels like maybe you should be able to use the existing ptype argument for this, with a special type specification that means "whatever, just make sure there's a column with this name". With my readr/readxl/googlesheets4 hat on, it would be like specifying a column by name, but with a guessed type.

Also the matter of additional columns feels related to cols() vs cols_only().

jennybc avatar Dec 17 '22 02:12 jennybc