mars icon indicating copy to clipboard operation
mars copied to clipboard

[PROPOSAL] DataFrame-oriented JIT Compiler

Open dlee992 opened this issue 3 years ago • 5 comments

DataFrame-oriented JIT compiler

Background

Mars is aimed to handle Pandas pipelines in a distributed way, which can be more efficient than Pandas. Mars can partition big data into many data chunks, which can be fetched by different workers. Finally, each worker executes the assigned computation with Pandas functionality. The computation speed in a chunk is almost still dominated by Pandas.

Thus, Mars shares some limitations with Pandas:

  1. Pandas execute most APIs in Python interpreter, which is often slower than compiled languages (e.g., Java, C++);
  2. Python GIL problem results that many parallelizable Pandas APIs (e.g., df.apply) cannot leverage multi-core capability;
  3. Pandas uses eager mode to execute most computation pipelines, which gives up optimization opportunities in the middle.

In summary, most Pandas computations are executed in non-compiled, non-parallelized, and non-lazy modes, which impedes performance optimization.

Key goals

For performance-critical computations (e.g., df.groupby().agg(), df.apply(udf)), we try to provide a compiled, parallelized, and efficient pipeline execution for Mars/Pandas. (We need Mars provides a convenient interface to fetch Pandas pipeline on a worker)

Proposals

We are implmenting a dataframe-oriented JIT compiler for Pandas, which will also benefit Mars and tightly integrated with Mars in this year.

Core functions

  1. JIT compilation: a basic JIT compilation framework for Python & Pandas code;
  2. Performance boost for a Pandas API: a time-comsuming Pandas API (e.g., df.eval(expr), df.apply(udf)) can be compiled and executed;
  3. Efficient execution for Pandas pipeline: a sequence of multiple Pandas APIs (e.g., df.groupby().agg()) will be analyzed and executed as a whole.

The relationship between our compiler and Mars will be similar as that between XLA and TensorFlow. Specifically, Mars can use our compiler in operator.execute method, or in a Pandas-like pipeline to accelerate execution efficiency.

Technical Details Disclosure

In a nutshell, we base our compiler on Numba and SDC (both are open-source). Numba (mostly developed by Anaconda) provides the JIT functionality for a subset of Python & Numpy code, which contains basic compilation tech, parallel tech (using openmp/tbb), and is based on LLVM backend. SDC (mostly developed by Intel) can be seen as a Numba's DataFrame extension, which makes Numba can handle a subset of Pandas code.

However, SDC almost stops updating and is not well-tested, and type inference in Numba compilation is not powerful enough for DataFrame computation. Besides, both don't have the optimization capability for Pandas-like pipeline scenes. We continues to develop our compiler in the above directions based on these two tools. The internal test results look good. We try to open-source our compiler project under mars-project ASAP within months.

Updates

[2022/2/25] Any new messages will be synced here. Welcome to subscribe this issue.

dlee992 avatar Feb 25 '22 07:02 dlee992

@dlee992 have you considered Bodo JIT for compiling Pandas dataframes in your system? It is based on Numba, and is available on Pip and Conda (community edition is free up to 8 cores). A lot of typing issues of dealing with Pandas dataframes are already resolved.

ehsantn avatar Apr 08 '22 14:04 ehsantn

@ehsantn , yes, we investigated related tools about pandas JIT compiler, and we also think bodo is a wonderful tool for our purpose. However bodo is not open-source, so we are on the way to extend/refactor SDC/Numba to fulfill our requirements.

BTW, I read your related papers, e.g., HPAT, HiFrame, etc, very impressive. Just curious, why do u guys choose to abandon SDC?

dlee992 avatar Apr 11 '22 06:04 dlee992

You'd like the ability to extend since the functionality available in Bodo community edition doesn't meet your needs? Or is it the other benefits of open source? I'd like to understand this as we work through what portions of our software needs to be open source in general. Are you open to a conversation?

Thanks for the kind words. I built a prototype called HPAT when I was at Intel Labs and CMU. I left in 2019 and started Bodo, but Intel decided to keep around a limited single-node version and renamed it to SDC. We couldn't find a way to collaborate directly with Intel on it.

ehsantn avatar Apr 11 '22 14:04 ehsantn

@ehsantn, currently, we want the single-node version of JIT compiler for Pandas operations, since we use Mars to manage distribution issues (Actually, single-node version of JIT compiler could be integrated into Pandas directly, I have already seen some Numba usage in Pandas).

We investigated SDC deeply, and extend some Pandas apis for it (e.g., apply, transform using prange feature or parallel option in Numba), and try to extend templates for Series/DataFrame operators/operations to accelerate the compilation process and enable to use rewrite pass (similar with the RewriteArrayExprs in Numba) after type inference in Numba.

And also we have several enhancement ideas but not to fulfill it right now. Yes, we are open to a conversation.

Would bodo like to open-source the single-node part? Or do u have a planning timeline about this?

dlee992 avatar Apr 12 '22 02:04 dlee992

Thanks for the details @dlee992 . Makes sense. We are thinking about open sourcing the single node part but don't have a timeline yet. We want to make sure it would be useful, since it requires a lot of engineering effort.

Could you ping me on Slack here or email (ehsan at bodo.ai) to chat?

ehsantn avatar Apr 13 '22 12:04 ehsantn