torcharrow icon indicating copy to clipboard operation
torcharrow copied to clipboard

Why does torcharrow use velox instead of acero (arrow compute)?

Open UranusSeven opened this issue 2 years ago • 4 comments

Hi guys,

Since torcharrow uses Arrow in-memory format to achieve zero-copy for external readers, I assume using acero is more intuitive, and the conversion from velox format to arrow format can avoid. So I am quite interested about why you choose velox instead of acero as the cpu backend?

Thanks!

UranusSeven avatar Aug 01 '22 08:08 UranusSeven

The conversion between Arrow and Velox is also zero-copy. Velox is also collaborating with Arrow Community and Voltron Data, as announced in https://www.globenewswire.com/news-release/2022/06/23/2468271/0/en/Voltron-Data-Announces-Commitment-to-Improve-Interoperability-Between-Apache-Arrow-and-the-Velox-Open-Source-Project-Created-by-Meta.html

Quoted

... this collaboration will enable Velox to be a ubiquitous data processing workhorse for the coming generation of large-scale Arrow-enabled systems.

cc @pedroerp

wenleix avatar Aug 01 '22 17:08 wenleix

Thanks, but I'm actually curious about the reasons you choose velox instead of arrow compute. Like for better performance, or for its extensibility, or any other reasons?

UranusSeven avatar Aug 02 '22 02:08 UranusSeven

Outside of performance, Velox provides a much richer feature set. It is also the library in which all other data systems at Meta are being consolidation into, so there's a strong argument on consistency with the code executed in engines like Presto, Spark, stream processing systems, and more.

As a more concrete example, for instance, all these engines can now provide the same set of functions (UDFs) to users, with consistent signature and semantics (which is not possible if there implementations are siloed).

pedroerp avatar Aug 04 '22 23:08 pedroerp

Thanks. Velox's feature set and its extensibility are what we are looking for, but the performance is also quite important for us. So I'm running TPC-H on Velox and arrow compute this week.

May I ask if there are any available official performance reports?

UranusSeven avatar Aug 10 '22 07:08 UranusSeven

Sorry about the delay. We have been running many ad-hoc experiments with TPC-H and TPC-DS and results are looking very promising, but we don't have official numbers to publish yet (of course it also depends on the system we're comparing against).

Happy to give you more context on some of this offline.

pedroerp avatar Sep 16 '22 23:09 pedroerp