Vegas icon indicating copy to clipboard operation
Vegas copied to clipboard

Push aggregations down to Spark

Open jeremyrsmith opened this issue 7 years ago • 1 comments

When using withDataFrame, Vegas collects all the data and has a threshold for sampling instead.

But when doing aggregations in your plot, this means it will fetch all the data to the driver – potentially sampling it – and push all of it to vega-lite, where the aggregation will happen in JavaScript in the browser. This is probably never what you want.

It would be totally possible to map AggOps to Spark aggregations, and push the aggregation itself down to Spark. This will reduce the cardinality of the data dramatically, and would probably eliminate the need to sample in most cases.

jeremyrsmith avatar Oct 18 '17 21:10 jeremyrsmith

Thanks, @jeremyrsmith

This is probably never what you want.

I also agree with that. I think that the default behaviour should be changed; it had better pass all the data by default to vega-lite. https://github.com/vegas-viz/Vegas/blob/1496432875f80e9e579cc584fa8fd299f34a71a6/spark/src/main/scala/vegas/sparkExt/package.scala#L8-L17

oshikiri avatar Jul 28 '18 16:07 oshikiri