tibbletime
tibbletime copied to clipboard
collapse_index on a broken index gives difficult-to-debug errors
Hi,
This is really feedback from an 'I shot myself in the foot and spent ages working out why/how' thing. I don't have a reprex per se - I suspect one could be made if I think about it for long enough...Apologies that I haven't had time to explore the issue further (yet).
I managed to produce a broken tibbletime index. The way I think I did this was by left joining two tibbletime tables. Because the tables started out with the same name, I renamed the index on the LHS before joining (so far so good - so our intended index is on the LHS with a new name, and the old index on the RHS has the original name and is still present but now may contain NULLs).
However I then tried to call dplyr::mutate with collapse_index on the old index name to make the collapsed index field by mistake rather than the new name.
The error you get is: Error in mutate_impl(.data, dots) : Evaluation error: missing value where TRUE/FALSE needed.
Clearly that is a long way away from where the error truly lies (ie. using an invalid index).
My thoughts:
- Is this an issue with joining two tibbletime tables with valid indices? Is the index on the RHS invalidated (ie. no longer recognised as a tibbletime index) as it should?
- Is this possible with a non-indexed date/time field containing NULLs? Would that result in the same error? Or would collapse_index spot that we're using an invalid index in those circumstances?
- If collapse_index is able to work on any columns (which I doubt), does collapse_index need to check for the presence of NULLs before working, and give a more descriptive error?
Hope this makes sense. If I get the time I'll try to make a reprex for you of this is in action if it doesn't make sense from the above!
I think there are two problems here.
library(tibbletime)
library(dplyr)
data(FB)
FB_time <- as_tbl_time(FB, date)
head(FB_time)
#> # A time tibble: 6 x 8
#> # Index: date
#> symbol date open high low close volume adjusted
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 FB 2013-01-02 27.4 28.2 27.4 28.0 69846400. 28.0
#> 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600. 27.8
#> 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400. 28.8
#> 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800. 29.4
#> 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300. 29.1
#> 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700. 30.6
FB_time2 <- rename(FB_time, date2 = date)
# Notice how "date" is the index but "date2" is the column name
FB_time2
#> # A time tibble: 1,008 x 8
#> # Index: date
#> symbol date2 open high low close volume adjusted
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 FB 2013-01-02 27.4 28.2 27.4 28.0 69846400. 28.0
#> 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600. 27.8
#> 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400. 28.8
#> 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800. 29.4
#> 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300. 29.1
#> 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700. 30.6
#> 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400. 31.3
#> 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000. 31.7
#> 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800. 31.0
#> 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600. 30.1
#> # ... with 998 more rows
# Can't find the "date" index
collapse_by(FB_time2)
#> Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'double'.
# This is your problem. NA values mess up the collapse_by()
FB_time$date[1] <- NA
collapse_by(FB_time)
#> Error in mutate_impl(.data, dots): Evaluation error: missing value where TRUE/FALSE needed.
Created on 2018-03-26 by the reprex package (v0.2.0).
For the first problem of changing the name of the index, I think that every function that could run into this problem calls get_index_quo() at some point. I think it would make sense to let get_index_quo() assert that the index is still a valid column name, and error out otherwise.
Additionally, I am going to support rename() as a dplyr verb. This means that if the user does date2=date like above, the resulting object is no longer a tbl_time, but is returned as a tibble because the index column (date) has been "removed".
Is renaming the index column name itself bad? If we can detect it is a rename, would it be possible to honour the renamed index variable instead?
The only way to possibly detect it is if it came from a dplyr::rename(). Trying to capture a manual rename from a user would be pretty hard I think, and introduce a lot of randomness i might not be able to control. I would rather be consistent across the board rather than just whitelist dplyr::rename() as I think even that might be difficult.
That’s reasonable. Is it worth /possible to warn the user that performing dplyr rename on a tbl_time outputs a tibble not a tbl_time?
Doing an explicit as_tibble forced downcast would be a lot better than coming back to the code later and trying to guess why the index doesn’t work!
I can see the value in both, and I kind of changed my mind after a bit of thinking. I added a branch that implements rename() with support for renaming the index column. It required a bit of tidyeval magic but I think it works nicely. Can you see if it works for you?
devtools::install_github("business-science/tibbletime", ref = "fix/valid-index-quo")
If it seems like it works nicely, I'll use it. If not, I'll revert back to returning a tibble. I can include a warning, but the way it currently works is to use the same system as the case where the user does select(FB_tbl_time, open). That returns a tibble too, not a tbl_time. That does not currently output a warning. Should it? It seems like the user knows what they are doing.
Hmm it might be worth warning if the output will be downcast just to avoid difficult-to-spot errors later.
If you are re-reading the code later seeing code which reads:
Dplyr::select(as_tibble(myTblTime) , foo, bar) gives you a hint you’re about to just get a tibble back,
Whereas:
Dplyr::select(myTblTime, foo, bar)
Doesn’t intuitively tell you that you’re about to lose time indexing.
Just a “warning - dplyr::select will downcast ‘myTblTime’ to a tibble. If you intend to do this, force downcast myTblTime using as_tibble” would suffice as a reminder.
(Only saying this having had a really fun session debugging my code on Sunday as you can guess!)
Dplyr code feels really obvious at the time. But things like tidyeval can make an innocuous-looking dplyr statement very tricky indeed, so might be worth just enforcing good practice.
One thing i worry about is being too noisy, especially when programming with this. I do the select() thing all the time inside package functions, knowing that it will drop the tbl_time class, but i dont want the user to see that because i am just doing it for some intermediate calculation. I could obviously opt for some kind of verbose = TRUE argument in select.tbl_time() but that feels clumsy given how clean the dplyr interface is
I think I'd compare this to doing a dplyr::join where if you fail to specify all the join parameters - you get a message which tells you what else has been used to make the join.
I think if you can downcast the object type to a tibble before using with dplyr this is a sensible way to omit the message whilst acknowledging that you know you're only going to get a tibble and not a tbl_time.
I wouldn't be keen on the verbose = TRUE because it doesn't cleanly indicate what it's hiding and I agree it would be clumsy. as_tibble(tbl_time) for the input feels cleaner and makes object type in = object type out, which reads more logically.
Perhaps the message should be cleaner: 'dplyr::select downcasts the tbl_time to a tibble. See X for more information' where X is a help page which advises which operations would take in a tbl_time and output a tibble, and advise the coding acknowledgement.
Tested with the new version this evening. We get a warning that the index has gone. I have then just reapplied a new index to the renamed column. All looks good so far :-).