tibbletime icon indicating copy to clipboard operation
tibbletime copied to clipboard

collapse_index on a broken index gives difficult-to-debug errors

Open md0u80c9 opened this issue 7 years ago • 13 comments

Hi,

This is really feedback from an 'I shot myself in the foot and spent ages working out why/how' thing. I don't have a reprex per se - I suspect one could be made if I think about it for long enough...Apologies that I haven't had time to explore the issue further (yet).

I managed to produce a broken tibbletime index. The way I think I did this was by left joining two tibbletime tables. Because the tables started out with the same name, I renamed the index on the LHS before joining (so far so good - so our intended index is on the LHS with a new name, and the old index on the RHS has the original name and is still present but now may contain NULLs).

However I then tried to call dplyr::mutate with collapse_index on the old index name to make the collapsed index field by mistake rather than the new name.

The error you get is: Error in mutate_impl(.data, dots) : Evaluation error: missing value where TRUE/FALSE needed.

Clearly that is a long way away from where the error truly lies (ie. using an invalid index).

My thoughts:

  • Is this an issue with joining two tibbletime tables with valid indices? Is the index on the RHS invalidated (ie. no longer recognised as a tibbletime index) as it should?
  • Is this possible with a non-indexed date/time field containing NULLs? Would that result in the same error? Or would collapse_index spot that we're using an invalid index in those circumstances?
  • If collapse_index is able to work on any columns (which I doubt), does collapse_index need to check for the presence of NULLs before working, and give a more descriptive error?

Hope this makes sense. If I get the time I'll try to make a reprex for you of this is in action if it doesn't make sense from the above!

md0u80c9 avatar Mar 25 '18 10:03 md0u80c9

I think there are two problems here.

library(tibbletime)
library(dplyr)

data(FB)

FB_time <- as_tbl_time(FB, date)

head(FB_time)
#> # A time tibble: 6 x 8
#> # Index: date
#>   symbol date        open  high   low close     volume adjusted
#>   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
#> 1 FB     2013-01-02  27.4  28.2  27.4  28.0  69846400.     28.0
#> 2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600.     27.8
#> 3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400.     28.8
#> 4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800.     29.4
#> 5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300.     29.1
#> 6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700.     30.6

FB_time2 <- rename(FB_time, date2 = date)

# Notice how "date" is the index but "date2" is the column name
FB_time2
#> # A time tibble: 1,008 x 8
#> # Index: date
#>    symbol date2       open  high   low close     volume adjusted
#>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
#>  1 FB     2013-01-02  27.4  28.2  27.4  28.0  69846400.     28.0
#>  2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600.     27.8
#>  3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400.     28.8
#>  4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800.     29.4
#>  5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300.     29.1
#>  6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700.     30.6
#>  7 FB     2013-01-10  30.6  31.5  30.3  31.3  95316400.     31.3
#>  8 FB     2013-01-11  31.3  32.0  31.1  31.7  89598000.     31.7
#>  9 FB     2013-01-14  32.1  32.2  30.6  31.0  98892800.     31.0
#> 10 FB     2013-01-15  30.6  31.7  29.9  30.1 173242600.     30.1
#> # ... with 998 more rows

# Can't find the "date" index
collapse_by(FB_time2)
#> Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'double'.

# This is your problem. NA values mess up the collapse_by()
FB_time$date[1] <- NA
collapse_by(FB_time)
#> Error in mutate_impl(.data, dots): Evaluation error: missing value where TRUE/FALSE needed.

Created on 2018-03-26 by the reprex package (v0.2.0).

DavisVaughan avatar Mar 26 '18 12:03 DavisVaughan

For the first problem of changing the name of the index, I think that every function that could run into this problem calls get_index_quo() at some point. I think it would make sense to let get_index_quo() assert that the index is still a valid column name, and error out otherwise.

DavisVaughan avatar Mar 26 '18 12:03 DavisVaughan

Additionally, I am going to support rename() as a dplyr verb. This means that if the user does date2=date like above, the resulting object is no longer a tbl_time, but is returned as a tibble because the index column (date) has been "removed".

DavisVaughan avatar Mar 26 '18 12:03 DavisVaughan

Is renaming the index column name itself bad? If we can detect it is a rename, would it be possible to honour the renamed index variable instead?

md0u80c9 avatar Mar 26 '18 13:03 md0u80c9

The only way to possibly detect it is if it came from a dplyr::rename(). Trying to capture a manual rename from a user would be pretty hard I think, and introduce a lot of randomness i might not be able to control. I would rather be consistent across the board rather than just whitelist dplyr::rename() as I think even that might be difficult.

DavisVaughan avatar Mar 26 '18 13:03 DavisVaughan

That’s reasonable. Is it worth /possible to warn the user that performing dplyr rename on a tbl_time outputs a tibble not a tbl_time?

Doing an explicit as_tibble forced downcast would be a lot better than coming back to the code later and trying to guess why the index doesn’t work!

md0u80c9 avatar Mar 26 '18 13:03 md0u80c9

I can see the value in both, and I kind of changed my mind after a bit of thinking. I added a branch that implements rename() with support for renaming the index column. It required a bit of tidyeval magic but I think it works nicely. Can you see if it works for you?

devtools::install_github("business-science/tibbletime", ref = "fix/valid-index-quo")

DavisVaughan avatar Mar 26 '18 13:03 DavisVaughan

If it seems like it works nicely, I'll use it. If not, I'll revert back to returning a tibble. I can include a warning, but the way it currently works is to use the same system as the case where the user does select(FB_tbl_time, open). That returns a tibble too, not a tbl_time. That does not currently output a warning. Should it? It seems like the user knows what they are doing.

DavisVaughan avatar Mar 26 '18 13:03 DavisVaughan

Hmm it might be worth warning if the output will be downcast just to avoid difficult-to-spot errors later.

If you are re-reading the code later seeing code which reads:

Dplyr::select(as_tibble(myTblTime) , foo, bar) gives you a hint you’re about to just get a tibble back,

Whereas:

Dplyr::select(myTblTime, foo, bar)

Doesn’t intuitively tell you that you’re about to lose time indexing.

Just a “warning - dplyr::select will downcast ‘myTblTime’ to a tibble. If you intend to do this, force downcast myTblTime using as_tibble” would suffice as a reminder.

md0u80c9 avatar Mar 26 '18 14:03 md0u80c9

(Only saying this having had a really fun session debugging my code on Sunday as you can guess!)

Dplyr code feels really obvious at the time. But things like tidyeval can make an innocuous-looking dplyr statement very tricky indeed, so might be worth just enforcing good practice.

md0u80c9 avatar Mar 26 '18 14:03 md0u80c9

One thing i worry about is being too noisy, especially when programming with this. I do the select() thing all the time inside package functions, knowing that it will drop the tbl_time class, but i dont want the user to see that because i am just doing it for some intermediate calculation. I could obviously opt for some kind of verbose = TRUE argument in select.tbl_time() but that feels clumsy given how clean the dplyr interface is

DavisVaughan avatar Mar 26 '18 14:03 DavisVaughan

I think I'd compare this to doing a dplyr::join where if you fail to specify all the join parameters - you get a message which tells you what else has been used to make the join.

I think if you can downcast the object type to a tibble before using with dplyr this is a sensible way to omit the message whilst acknowledging that you know you're only going to get a tibble and not a tbl_time.

I wouldn't be keen on the verbose = TRUE because it doesn't cleanly indicate what it's hiding and I agree it would be clumsy. as_tibble(tbl_time) for the input feels cleaner and makes object type in = object type out, which reads more logically.

Perhaps the message should be cleaner: 'dplyr::select downcasts the tbl_time to a tibble. See X for more information' where X is a help page which advises which operations would take in a tbl_time and output a tibble, and advise the coding acknowledgement.

md0u80c9 avatar Mar 26 '18 21:03 md0u80c9

Tested with the new version this evening. We get a warning that the index has gone. I have then just reapplied a new index to the renamed column. All looks good so far :-).

md0u80c9 avatar Mar 27 '18 19:03 md0u80c9