dplyrimpaladb
dplyrimpaladb copied to clipboard
Best strategy for data loading from dplyr with respect to ImpalaDB
Continuing thread from https://github.com/hadley/dplyr/issues/383#issuecomment-120646259 - currently reliant on rhdfs for table loading, which is also used for handling temp tables as generated by dplyr.
Has this project been abandoned?
No.
Are there any updates? Thank you for working on this project - it's been a godsend. That said it's very slow and I've had to switch to RImpala.
Can you give a concrete easily reproducible example of what your problem is (maybe with the Lahman dataset?) so that I can go through it?
Cheers, Piers Harding
Hmmm I issue is speed, I don't know if that's reproducible. Have you tried benchmarking dplyrimpaladb against RImpala?
OK - can you give me an example of what you are doing in RImpala and the equivalent that you are trying in dplyr so that I can try and figure out what is going on. dplyrimpaladb show theoretically be a thin layer over the top of the underlying Java libs - outside of that it could be generated SQL....
FTR I had the opposite experience regarding speed/performance with RJDBC or dplyrimpaladb VS Rimpala -- documented at http://datascience.la/r-and-impala-its-better-to-kiss-than-using-java