consider using disk.frame as a backend to dbplyr
disk.frame seems naturally to fit as an alternative backend to dbplyr. Have you considered using the infrastructure you've set up in this way? The disk.frame package would then be like the database access package and then all the front-end interaction stuff would be through dplyr?
The interaction for the user would be substantially the same but the benefits would be:
- Wider adoption
- Support from tidyverse developers
- Tried and true standardised interface
https://db.rstudio.com/dplyr/
Thanks for bringing this to my attention. Are you saying that we should set up disk.frame like a remote server database? That could be interesting.
I probably need time to digest this. Some quick thoughts:
- The attraction of disk.frame for me is in the minimal setup required which is
install.packages("disk.frame"). So setting up a remote database might be counter to that. But it's worth exploring, becausedisk.frame` allows you to run native R functions. A managed disk.frame service on the cloud could be a valid business idea. dbplyris used to connect to databases and exposedplyrverbs so easy manipulation. For this aspect,disk.framealready exposesdplyrverbs, so for the use-case where no remote server is involved, it is pretty similar.- I want to introduce a
disk.frame.dbpackage wheregroup_byis done one in one-step not over two stages. So unless a remote server is involved, there is minimal advantage fordisk.frame
The database wouldn't necessarily need to be remote to work with dbplyr, I believe. The advantage would be that you wouldn't have to maintain the dplyr interface.
you wouldn't have to maintain the dplyr interface
I think you do, actually. This is because dbplyr translates the dplyr verbs into SQL (or whichever querying language the DB is using). This translation step doesn't happen automatically, and needs to be created by hand and customised for each DB. So there needs to be customisation for disk.frame as well. Unless my understanding of dbplyr is all wrong.
Unless it's remote, the attraction isn't there, because I want setting up a disk.frame to be of minimum hassle. A managed disk.frame.db would be really nice for the occasionally large data workload, or used as part of a pipeline.
I also don't have much understanding of dbplyr - but this page: https://db.rstudio.com/best-practices/select-interface/ seems to suggest that the translation happens based on a "consistent set of functions that work across all connections" provided via the DBI package
From what I can see, DBI can send SQL statements as strings to databases.

It doesn't use dplyr, so I deduce from that that dbplyr must translate the dplyr verbs into one of the SQL strings, as seems the case in the post

Currently, disk.frame doesn't support SQL (and don't plan to), so there would be some minor maintenance work to "translate" the dplyr verbs. The great thing is that disk.frame just uses data.frames so no translation is necessary! All I need to do is to faithfully execute the dplyr verbs as defined! The non-dbplyr version of disk.frame can already do this.
So I think a disk.frame for dbplyr makes the most sense as a remote database.
While I think a disk.frame db would be great, I'm curious how it would be different from setting up a SQLite database (which is fairly reasonable and has very few dependencies). In particular, I think the latter option would probably have much higher peak performance due to indexing
I'm curious how it would be different from setting up a SQLite database
With disk.frame you can apply native R functions efficiently but sqlite is more limited in that regard. But you can test out the speed? Or show me a dataset and a set of operation in sqlite and I can do a compare of the speed?
I actually agree that disk.frame as a dbplyr-backend makes sense and would be about as performant as is possible to use custom R functions (sqlite is definitely more limited and slightly more of a headache to manage)! For peak performance, SQLite might still win, so there's just a tradeoff for folks to consider (similar to data.table vs dplyr!)