disk.frame icon indicating copy to clipboard operation
disk.frame copied to clipboard

consider using disk.frame as a backend to dbplyr

Open kendonB opened this issue 6 years ago • 8 comments

disk.frame seems naturally to fit as an alternative backend to dbplyr. Have you considered using the infrastructure you've set up in this way? The disk.frame package would then be like the database access package and then all the front-end interaction stuff would be through dplyr?

The interaction for the user would be substantially the same but the benefits would be:

  1. Wider adoption
  2. Support from tidyverse developers
  3. Tried and true standardised interface

https://db.rstudio.com/dplyr/

kendonB avatar Sep 12 '19 01:09 kendonB

Thanks for bringing this to my attention. Are you saying that we should set up disk.frame like a remote server database? That could be interesting.

I probably need time to digest this. Some quick thoughts:

  • The attraction of disk.frame for me is in the minimal setup required which is install.packages("disk.frame"). So setting up a remote database might be counter to that. But it's worth exploring, because disk.frame` allows you to run native R functions. A managed disk.frame service on the cloud could be a valid business idea.
  • dbplyr is used to connect to databases and expose dplyr verbs so easy manipulation. For this aspect, disk.frame already exposes dplyr verbs, so for the use-case where no remote server is involved, it is pretty similar.
  • I want to introduce a disk.frame.db package where group_by is done one in one-step not over two stages. So unless a remote server is involved, there is minimal advantage for disk.frame

xiaodaigh avatar Sep 12 '19 01:09 xiaodaigh

The database wouldn't necessarily need to be remote to work with dbplyr, I believe. The advantage would be that you wouldn't have to maintain the dplyr interface.

kendonB avatar Sep 12 '19 01:09 kendonB

you wouldn't have to maintain the dplyr interface

I think you do, actually. This is because dbplyr translates the dplyr verbs into SQL (or whichever querying language the DB is using). This translation step doesn't happen automatically, and needs to be created by hand and customised for each DB. So there needs to be customisation for disk.frame as well. Unless my understanding of dbplyr is all wrong.

Unless it's remote, the attraction isn't there, because I want setting up a disk.frame to be of minimum hassle. A managed disk.frame.db would be really nice for the occasionally large data workload, or used as part of a pipeline.

xiaodaigh avatar Sep 12 '19 01:09 xiaodaigh

I also don't have much understanding of dbplyr - but this page: https://db.rstudio.com/best-practices/select-interface/ seems to suggest that the translation happens based on a "consistent set of functions that work across all connections" provided via the DBI package

kendonB avatar Sep 12 '19 02:09 kendonB

From what I can see, DBI can send SQL statements as strings to databases.

image

It doesn't use dplyr, so I deduce from that that dbplyr must translate the dplyr verbs into one of the SQL strings, as seems the case in the post image

Currently, disk.frame doesn't support SQL (and don't plan to), so there would be some minor maintenance work to "translate" the dplyr verbs. The great thing is that disk.frame just uses data.frames so no translation is necessary! All I need to do is to faithfully execute the dplyr verbs as defined! The non-dbplyr version of disk.frame can already do this.

So I think a disk.frame for dbplyr makes the most sense as a remote database.

xiaodaigh avatar Sep 12 '19 02:09 xiaodaigh

While I think a disk.frame db would be great, I'm curious how it would be different from setting up a SQLite database (which is fairly reasonable and has very few dependencies). In particular, I think the latter option would probably have much higher peak performance due to indexing

kmishra9 avatar May 08 '20 01:05 kmishra9

I'm curious how it would be different from setting up a SQLite database

With disk.frame you can apply native R functions efficiently but sqlite is more limited in that regard. But you can test out the speed? Or show me a dataset and a set of operation in sqlite and I can do a compare of the speed?

xiaodaigh avatar May 08 '20 01:05 xiaodaigh

I actually agree that disk.frame as a dbplyr-backend makes sense and would be about as performant as is possible to use custom R functions (sqlite is definitely more limited and slightly more of a headache to manage)! For peak performance, SQLite might still win, so there's just a tradeoff for folks to consider (similar to data.table vs dplyr!)

kmishra9 avatar Feb 16 '21 20:02 kmishra9