tibble Name repair after column assignment

Name repair after column assignment

Open krlmlr opened this issue 3 years ago • 11 comments

One of tibble's invariants is that column subassignment never touches columns outside j. This leads to the following behavior:

library(tibble)

# Existing behavior, with test
df <- tibble(x = 1:2)
df[2] <- tibble(x = 3:4)
df
#> # A tibble: 2 × 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

# New behavior
df[3:5] <- list(y = 5:6)
df
#> # A tibble: 2 × 5
#>       x     x y...3 y...4 y...5
#>   <int> <int> <int> <int> <int>
#> 1     1     3     5     5     5
#> 2     2     4     6     6     6

^{Created on 2021-07-31 by the reprex package (v2.0.0.9000)}

I added the second example only today, it still satisfies the invariant though. (Previously, all new columns were named y here.)

I think we discussed this before. I'm raising this again because the first behavior cannot be (easily) implemented with the current definition of dplyr_col_modify(). Moreover, I still think the first behavior is correct, unless we allow name repair to propagate over columns outside j.

tibble achieves this by implementing tbl_subassign_col(x, j, value); here, j is a named integer vector, and value is an unnamed list. If tibble were to use the current dplyr_col_modify() approach, tbl_subassign_col() would become somewhat slower due to the repeated matching of column names (as opposed to fast index-based addressing).

Should we:

change tibble's invariants in the first example
enhance dplyr_col_modify() to accept and use an attribute attached to value that indicates the target index
work around by temporarily repairing names in the first example, to use dplyr_col_modify()
implement df_col_modify(x, j, value, ...) instead, along with df_reconstruct() and df_row_slice()

CC @hadley @jennybc @DavisVaughan.

Jul 31 '21 07:07 krlmlr

Part of me thinks the first example shouldn't work or, rather, should require some explicit declaration of .name_repair = "minimal", since that would be necessary to create this tibble "all at once" vs. incrementally.

tibble::tibble(
  x = 1:2,
  x = 3:4,
  .name_repair = "minimal"
)
#> # A tibble: 2 x 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

Is it useful to articulate the the default name repair strategy for column assignment? I guess it's "minimal"?

In general, it feels weird to have an object that is easy (or possible) to construct one way but very hard (or impossible) to construct another way. However, mostly I've been thinking about such things in the context of tibble subclasses (i.e. the lead up to #890).

I'm afraid I don't understand the innards and implications enough to have a strong opinion on your more detailed questions.

Jul 31 '21 15:07 jennybc

Thanks. I guess the question now simplifies to:

Should we maintain tibble's invariant that columns outside j are off limits?
If the default name repair strategy is "minimal", even for new column names, this means that the second example should have all new columns named "y"?

Sorry for the noise -- by now I have found that it is still possible to create such a tibble incrementally, even using dplyr_col_modify():

out <- dplyr::dplyr_col_modify(data.frame(x = 1), list(2, 3))
names(out) <- c("x", "x", "x")
out
#>   x x x
#> 1 1 2 3

# This expands to:
out <- list(x = 1)
out[[""]] <- 2
out[[""]] <- 3
names(out) <- c("x", "x", "x")
vctrs::new_data_frame(out)
#>   x x x
#> 1 1 2 3

^{Created on 2021-07-31 by the reprex package (v2.0.0.9000)}

I don't understand "one way" and "(hard/impossible to construct) another way".

Jul 31 '21 16:07 krlmlr

I don't understand "one way" and "(hard/impossible to construct) another way".

library(tibble)

# incremental construction makes duplicate names easy / the default
df <- tibble(x = 1:2)
df[2] <- tibble(x = 3:4)
df
#> # A tibble: 2 x 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

# actually, it depends on how you do incremental construction
foo <- tibble(x = 1:2)
foo <- add_column(foo, x = 3:4)
#> Error: Column name `x` must not be duplicated.
#> Use .name_repair to specify repair.
foo <- add_column(foo, x = 3:4, .name_repair = "minimal")
foo
#> # A tibble: 2 x 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

# "all at once" construction makes duplicate names hard to achieve
dat <- tibble::tibble(x = 1:2, x = 3:4)
#> Error: Column name `x` must not be duplicated.
#> Use .name_repair to specify repair.
dat <- tibble::tibble(x = 1:2, x = 3:4, .name_repair = "minimal")
dat
#> # A tibble: 2 x 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

wut <- as_tibble(list(x = 1:2, x = 3:4))
#> Error: Column name `x` must not be duplicated.
#> Use .name_repair to specify repair.
wut <- as_tibble(list(x = 1:2, x = 3:4), .name_repair = "minimal")
wut
#> # A tibble: 2 x 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

It feels like tibble tends overwhelmingly towards .name_repair = "unique", i.e. if you want something else you have to explicitly say so. In the context of all those examples, the behaviour of df[2] <- tibble(x = 3:4) feels "out of character".

Jul 31 '21 16:07 jennybc

I hear you now. I suspect your answer to the second question is that this behavior is fine.

To me it looks like [ is a lower-level interface. Currently, name repair in [ occurs only when creating a new unnamed column, or (as of recently) when duplicating a newly-created named column by spreading the same column over multiple target columns. This uses "unique" name repair.

Should we maybe do this:

library(tibble)

# Existing behavior, with test
df <- tibble(x = 1:2)
df[2] <- tibble(x = 3:4)
df
#> # A tibble: 2 × 2
#>       x x...2
#>   <int> <int>
#> 1     1     3
#> 2     2     4

This would be a hybrid solution that is somewhat consistent with the rest of tibble and also keeps columns outside j unchanged. Would that be a viable approach?

This aspect was intentionally left unspecified in the invariants, albeit only for subsetting:

When subsetting repeated indexes, the resulting column names are undefined, do not rely on them.

We could apply "unique" repair for subsetting too.

Jul 31 '21 17:07 krlmlr

I am not sure comparing tibble sub assignment behavior directly against dplyr_col_modify() is the right way to go here. dplyr_col_modify() is mainly used for mutate/summarise which always overwrite an existing column named x if you do mutate(x = stuff). So there is no need to consider name repair strategies there.

Aug 02 '21 15:08 DavisVaughan

library(tibble)

# Existing behavior, with test
df <- tibble(x = 1:2)
df[2] <- tibble(x = 3:4)
df
#> # A tibble: 2 × 2
#>       x x...2
#>   <int> <int>
#> 1     1     3
#> 2     2     4

I'm not sure if name repair on just the added column makes sense. I think we've learned that, to hit a specific level of name repair, it works best if you are operating on all names. Pathological example: what if the initial column had been named x...2?

Aug 02 '21 15:08 jennybc

My first instinct is to agree with @jennybc's thoughts that some of these examples shouldn't work without explicit setting of .name_repair. I could imagine [<- using "check_unique" (like much of the rest of tibble), but also exposing tbl_col_assign() (or maybe a more general tbl_assign()) with an explicit .name_repair argument.

So:

df <- tibble(x = 1:2)

# error because you'd have duplicate names
# probably desired behavior for interactive usage
df[2] <- tibble(x = 3:4)

# explicit request for unique name repair
# would result in `c("x...1", "x...2")`
# obviously this touches columns outside `j`, but it feels ok since the user explicitly requested the name repair?
# might be best for programmatic usage
tbl_col_assign(df, 2, tibble(x = 3:4), .name_repair = "unique")

I have no idea if switching to "check_unique" would be too strict or not.

Aug 02 '21 15:08 DavisVaughan

Agreed: if we repair, we repair on all columns. In this case add_column() would be different from [<-().

I like the check_unique idea, we could run revdepchecks. For exotic use cases, we have add_column() that exposes name repair.

I also don't mind if the second example in the original post fails if a column named x...3 exists.

How does that sound?

Aug 02 '21 17:08 krlmlr

I think this last proposal by @krlmlr sounds pretty interesting and would be interesting to see revdeps for.

Aug 02 '21 19:08 jennybc

I reverted #929. I'll turn the following inta failure, with separate revdepchecks:

library(tibble)

# Existing behavior, with test
df <- tibble(x = 1:2)
df[2] <- tibble(x = 3:4)
df
#> # A tibble: 2 × 2
#>       x     x
#>   <int> <int>
#> 1     1     3
#> 2     2     4

# Duplication
df[3:5] <- list(y = 5:6)
df
#> # A tibble: 2 × 5
#>       x     x     y     y     y
#>   <int> <int> <int> <int> <int>
#> 1     1     3     5     5     5
#> 2     2     4     6     6     6

# Subsetting
df[c(1, 1)]
#> # A tibble: 2 × 2
#>       x     x
#>   <int> <int>
#> 1     1     1
#> 2     2     2

^{Created on 2021-08-07 by the reprex package (v2.0.0.9000)}

Aug 07 '21 13:08 krlmlr

Reverted in 7ebc39412381d80cb7b7c58a8a84b9bed3ad1c5f.

Oct 24 '21 02:10 krlmlr

Do we now agree that [ should never do name repair for tibbles?

CC @hadley.

Feb 23 '23 04:02 krlmlr

I have only re-skimmed this conversation. But if I follow, yes, it seems reasonable to me that [<-() does not cause any name repair. It feels like an explicit move and, if you do it, you take responsibility for the resulting object.

Feb 23 '23 16:02 jennybc

Not repairing names after column assignment.

Mar 04 '23 04:03 krlmlr

tibble tibble copied to clipboard

Name repair after column assignment

tibble
tibble copied to clipboard