iree-llvm-sandbox icon indicating copy to clipboard operation
iree-llvm-sandbox copied to clipboard

Collaboration on data analytics workloads in MLIR

Open harsh-nod opened this issue 3 years ago • 12 comments

Hi folks @ingomueller-net @webmiche,

@bsarden-rivos and myself are interested in running data analytics workloads (as found in popular frameworks such as pandas) on iree. To do that, we were trying to flesh out a path from pandas to mlir. I have a simple prototype that takes element-wise addition and lowers it to linalg here: https://github.com/nod-ai/pandas-mlir. But recently, based on @bsarden-rivos's findings, we have been thinking of using substrait (https://substrait.io) and more specifically, the ibis-subtrait compiler (https://github.com/ibis-project/ibis-substrait) as a starting point for lowering to MLIR (linalg on tensors).

Looking at your commits in this repo, seems like you are exploring an alternate path to get to MLIR and so would love to chat and brainstorm with you all about your project goals, roadmap, current state of things and how we can align efforts and collaborate in any way.

Thanks and looking forward to collaborating with you all, Harsh

harsh-nod avatar Jun 02 '22 22:06 harsh-nod

Hello @harsh-nod , thanks for sharing your plans!

Would you be available to meet next week with @webmiche ? Since Harsh is on the W coast, I could meet Thursday 5-7pm CEST (8-10am PST) or after 8 pm CEST / 11am PST.

Do you have availabilities in these slots ?

nicolasvasilache avatar Jun 03 '22 16:06 nicolasvasilache

Hi @nicolasvasilache ,

Unfortunately, I am out of the office on Thursday the 9th, but (8am PST) works for me on Mon, Tue, Wed of next week. Alternatively, if only Thursdays work, I can do Thursday June 16 at 8am PST. Do any of those days work for you?

harsh-nod avatar Jun 03 '22 16:06 harsh-nod

Hey, that sounds interesting, looking forward to the meeting!

In terms of timing, I can't make Thursday 9th since I will be on military service Thursday and Friday. Other than that, I can make 8am PST on any day, but Fridays.

webmiche avatar Jun 04 '22 10:06 webmiche

Ok, let's do June 16th at 8am PST? I can also other days but it seems that this particular day is already preidentified as working for all.

nicolasvasilache avatar Jun 06 '22 17:06 nicolasvasilache

Hi all, I'm excited for our chat! I can also do other days, but earlier than June 16th works better for me since I have some free cycles to work on this early this week / next week. My schedule is wide open, but how does Monday (6/13) at 8am work for folks? A few questions for @webmiche in the meantime...:

  1. Where would be the best place to start contributing? Looking at some of the PR's in flight I'm also interested in running a tcph query through MLIR, but not sure where to start.
  2. What would be the best path for running a query e2e? Does extending alp and the AlpRuntime to execute a query make sense, or is there already something in the works that I can help flesh out?

bsarden-rivos avatar Jun 06 '22 18:06 bsarden-rivos

Unfortunately I am out of the office on Monday 6/13, 14 and 15. If we want something sooner, I can meet 8am PST tomorrow 6/7 or the day after 6/8?

harsh-nod avatar Jun 06 '22 18:06 harsh-nod

I can meet 8am PST tomorrow 6/7 or the day after 6/8?

Either time works for me!

bsarden-rivos avatar Jun 06 '22 19:06 bsarden-rivos

Unfortunately I am out of the office on Monday 6/13, 14 and 15. If we want something sooner, I can meet 8am PST tomorrow 6/7 or the day after 6/8?

For me, that time window would only work tomorrow (6/8).

  1. Where would be the best place to start contributing? Looking at some of the PR's in flight I'm also interested in running a tcph query through MLIR, but not sure where to start.

I think it would be very useful for our meeting, if you could look through the tpc-h queries and maybe think a bit about some of the challenges for modeling/running with mlir. I think I found "hard to solve" problems for most of them and I feel that Q6 is the most reasonable to get running first, but I would be happy about a second opinion.

  1. What would be the best path for running a query e2e? Does extending alp and the AlpRuntime to execute a query make sense, or is there already something in the works that I can help flesh out?

This is still very much an open question. The broad idea that we have is that since pandas stores data in columnar form and these columns are numpy arrays, we extract the numpy arrays from pandas and pass them to our mlir-functions (find the file here). This approach piggy backs off of parts of the sandbox. AFAIK, we have not yet developed a more concrete/complete idea of how this should look in the end.

webmiche avatar Jun 07 '22 06:06 webmiche

6/8 at 8am PST is great, I'll post a link here

nicolasvasilache avatar Jun 07 '22 08:06 nicolasvasilache

Not sure if you all have read this (just came out a few days ago), but found an interesting paper on implementing relational operators in PyTorch and running on TPC-H queries (including Q6) where they outperform DuckDB. Query Processing on Tensor Computation Runtimes

harsh-nod avatar Jun 07 '22 16:06 harsh-nod

Here is the meeting for today's meeting. Video call link: https://meet.google.com/ndw-fzsv-hqb Or dial: ‪(CH) +41 31 560 24 00‬ PIN: ‪295 558 240 8107‬# More phone numbers: https://tel.meet/ndw-fzsv-hqb?pin=2955582408107

nicolasvasilache avatar Jun 08 '22 08:06 nicolasvasilache

I'm trying to join the meeting but its stuck at "Asking to join...".

harsh-nod avatar Jun 08 '22 15:06 harsh-nod