dplyrimpaladb icon indicating copy to clipboard operation
dplyrimpaladb copied to clipboard

Best strategy for data loading from dplyr with respect to ImpalaDB

Open piersharding opened this issue 9 years ago • 7 comments

Continuing thread from https://github.com/hadley/dplyr/issues/383#issuecomment-120646259 - currently reliant on rhdfs for table loading, which is also used for handling temp tables as generated by dplyr.

piersharding avatar Jul 11 '15 18:07 piersharding

Has this project been abandoned?

peterparkerspicklepatch avatar Apr 18 '16 17:04 peterparkerspicklepatch

No.

piersharding avatar Apr 18 '16 18:04 piersharding

Are there any updates? Thank you for working on this project - it's been a godsend. That said it's very slow and I've had to switch to RImpala.

peterparkerspicklepatch avatar Apr 19 '16 19:04 peterparkerspicklepatch

Can you give a concrete easily reproducible example of what your problem is (maybe with the Lahman dataset?) so that I can go through it?

Cheers, Piers Harding

piersharding avatar Apr 19 '16 19:04 piersharding

Hmmm I issue is speed, I don't know if that's reproducible. Have you tried benchmarking dplyrimpaladb against RImpala?

peterparkerspicklepatch avatar May 10 '16 21:05 peterparkerspicklepatch

OK - can you give me an example of what you are doing in RImpala and the equivalent that you are trying in dplyr so that I can try and figure out what is going on. dplyrimpaladb show theoretically be a thin layer over the top of the underlying Java libs - outside of that it could be generated SQL....

piersharding avatar May 10 '16 21:05 piersharding

FTR I had the opposite experience regarding speed/performance with RJDBC or dplyrimpaladb VS Rimpala -- documented at http://datascience.la/r-and-impala-its-better-to-kiss-than-using-java

daroczig avatar May 10 '16 21:05 daroczig