datafusion
datafusion copied to clipboard
Finalize SIGMOD 2024 paper ~(if accepted)~
Is your feature request related to a problem or challenge?
@JayjeetAtGithub @Dandandan @yjshen @ozankabak @sunchao and @viirya and submitted a paper to the SIGMOD 2024 conference, which was tracked by https://github.com/apache/arrow-datafusion/issues/6782
If our paper is accepted, this ticket tracks follow on work items to complete prior to the final copy
For the Industrial Track the dates are:
- All deadlines below are 11:59 PM Pacific Time.
- Paper submission: Thursday, November 30, 2023
- Notification of accept/reject: Wednesday, January 31, 2024
- Camera-ready deadline: Thursday, March 28, 2024
Describe the solution you'd like
Here are the items I know so far:
- [x] Fix the (currently) non working email for @JayjeetAtGithub ([email protected] currently does not work)
- [ ] Clean up bibliography into a consistent style (sometimes all authors are listed, sometimes just the first one is -- they should all be the same)
Nice to haves:
- [ ] Update benchmark scripts: https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark/pull/25
- [ ] Rerun the benchmarks with versions of datafusion and duckdb that have been released since our initial runs (e.g.
datafusion 33
andduckdb 0.9.2
) https://github.com/apache/arrow-datafusion/issues/6782#issuecomment-1827504952) - [ ] Update the
Results
section with the new results, updating the query textual description if needed
Describe alternatives you've considered
No response
Additional context
No response
Here is what we submitted: DataFusion_Query_Engine___SIGMOD_2024.pdf
I can check results and update them, please assign it to me.
FWIW the notification deadline was yesterday but I have not heard anything one way or the other (and the CMT tool doesn't say one way or the other). I will email the chairs tomorrow if we haven't heard by then
I emailed the chairs today and they said the notification will be delayed a few days. Will post updates here as I have them.
Thanks for the update 🚀
Thank you @alamb
The paper was accepted to SIGMOD! 🎉
I'll spend some time reviewing the comments later this week and we can organize action items for the final draft
From: Microsoft CMT <[email protected]>
Date: Sun, Feb 4, 2024 at 11:28 AM
Subject: SIGMOD 2024 Industry Track decision for Paper 1
To: Andrew Lamb
Dear Andrew Lamb,
It is our great pleasure to inform you that your paper #1 "Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine" has been Accepted to the conference. Congratulations!
The papers will be presented at SIGMOD 2024, in Santiago, Chile in June, so please plan for at least one author of the paper to attend the conference. This might require a visa, so please consult the following page at your earliest convenience https://2024.sigmod.org/visa_chile.shtml .
We hope that you will find the reviews helpful in revising accordingly the camera-ready version of the manuscript. Please note that the papers will appear in PACMMOD, like the Research Track papers of SIGMOD 2024. The formatting guidelines for the camera-ready papers are available at: https://dl.acm.org/journal/pacmmod/author-guidelines#formatting , under the "Length and Format for Camera-Ready Papers" section.
Congratulations again and looking forward to seeing you at SIGMOD 2024!
Danica and Ippokratis.
Cool! Congrats to all!
This is great news! congrats all!
Congratulations everyone !
Great news! Congratulations to all involved!
Congratulations everyone !
Here is the reviewer feedback
Reviewer #2 Questions
-
Is the paper readable and well organized? Definitely - very clear
-
Does this paper present a significant addition to the body of work in the area of data management research? Definitely - a significant addition
-
Is the paper likely to have a broad impact on the data management community? SIGMOD attendees will learn something interesting from the paper The paper is likely to influence research in the community
-
Overall rating Accept
-
Reviewer’s confidence Expert
-
Strong points
-
Good presentation of the Apache Arrow DataFusion open-source project.
-
DataFusion efficiently implements operators that can be used by various other data systems, avoiding their cumbersome re-implementation.
-
Good experimental results versus DuckDB (which is an extremely well optimized embeddable analytics database).
-
I really appreciate how the DataFusion community was involved even in writing this paper. See here: https://github.com/apache/arrow-datafusion/issues/6782
-
Weak points
-
Minor: although well-engineered, the algorithms behind the supported operators are not new. DataFusion implements well-known techniques.
-
Overall comments The paper describes the functionality of DataFusion, a very well-designed and implemented library based on Apache Arrow, which implements a variety of operators used in SQL. Similar to Arrow, DataFusion is an embeddable library (built in Rust), which can easily be embedded in broader data systems that require analytical operations. The paper includes a nice experimental evaluation versus DuckDB, demonstrating good results.
Reviewer #5 Questions
- Is the paper readable and well organized? Definitely - very clear
- Does this paper present a significant addition to the body of work in the area of data management research? Mostly - the contributions are above the bar
- Is the paper likely to have a broad impact on the data management community? SIGMOD attendees will learn something interesting from the paper
- Overall rating Accept
- Reviewer’s confidence Expert
- Strong points
- The paper is well written.
- Extensive evaluation using 3 popular benchmarks.
- An active community-driven project.
- Weak points
- The DataFusion project is a combination and integration of other well-known components/systems; as such, its overall technical novelty is limited.
- The experimental evaluation didn't compare against many other popular OLAP systems in the field.
- The support for complex analytical queries (e.g., multi-way join as those found in TPC-DS) is limited.
- Overall comments This paper is well written and the DataFusion project has a good momentum in the community. The idea of building an OLAP engine using a decoupled, component-based approach is interesting (versus tightly coupled designs). The paper has described most elements in DataFusion, but didn't offer enough details to demonstrate sufficient technical novelty (that goes beyond integration of various existing componentshe). How to better suit the cloud environment where most OLAP engines are running on nowadays is also not discussed in the paper.
Reviewer #7 Questions
- Is the paper readable and well organized? Mostly - the presentation has minor issues, but is acceptable
- Does this paper present a significant addition to the body of work in the area of data management research? Mostly - the contributions are above the bar
- Is the paper likely to have a broad impact on the data management community? SIGMOD attendees will learn something interesting from the paper
- Overall rating Reject
- Reviewer’s confidence Knowledgeable
- Strong points
- Presents the technologies that power DataFusion and provides motivating use cases for using DataFusion, making a compelling argument over reuse in analytic systems using commodity OLAP engines and a paradigm shift in that direction.
- Provides extensive evaluation of DataFusion's performance.
- Presents DataFusion's architecture, extension APIs and features.
- Weak points
- One of the main claims of the paper is that DataFusion is catalyzing the development of new data systems. The presentation and the evaluation of the paper would benefit from elaborating further on this claim.
- Section 5.1 Engine overview and Figure 2 need to be more extensive to be able to follow the rest of Section 5.
- The LLVM analogy distracts from the paper.
- The paper claims in section 7.4 that "..DataFusion can be customized for these different environments using the MemoryPool trait to control memory allocations, the DiskManager trait for managing temporary files (if any), and a CacheManager for caching information such as directory contents and per-file metadata.". More technical details on this topic would be helpful.
- The "Single Core Efficiency" section could benefit from running TPC-H across multiple threads and configuration settings. The authors mention a caveat of restricting duckDB performance for some benchmarks by using single thread.
- Overall comments I would like to thank the authors for their work. Please find some additional minor comments below:
-
Please move the figure out of the first page, or to the bottom of the first page. It is distracting to read the caption of Figure 1 before the abstract.
-
Please update the axes in Figure 7 to be legible.
-
One of the main topics of the paper is that DataFusion catalyzes the development of new data systems. Evaluation in that direction would help support the claims in the paper further. One related angle could be the ease of developing systems (applications) on top of DataFusion, potentially including the overhead in terms of lines of code or engineering hours in developing a simple system/application with DataFusion and using a different stack or being customly built. Similarly, performance evaluation of systems relying on DataFusion could help in this direction as well.
-
Similarly, the content of the paper would benefit from doing a deep dive into the query engine and a limited set of features based on how they are used by systems developed on DataFusion.
It appears we have about 2 months to complete the final draft
Camera-ready deadline: Thursday, March 28, 2024
Here is a summary of my suggested action items based on the reviewer feeback above
- [x] Add more examples / better explanation of systems built on DataFusion (we have some good new examples I know of since -- Arroyo, Comet, and LanceDB comes to mind)
- [x] Please move the figure out of the first page, or to the bottom of the first page. It is distracting to read the caption of Figure 1 before the abstract.
- [x] "Section 5.1 Engine overview and Figure 2 need to be more extensive to be able to follow the rest of Section 5."
- [x] "The LLVM analogy distracts from the paper." - I happen to like this analogy (and I think @ozankabak does too), but maybe we can make this section shorter / more concise.
- [x] Extend section 7.4's description with technical details about
MemoryPool
,DiskManager
, andCacheManager
for caching information such as directory contents and per-file metadata." - [x] ?The authors mention a caveat of restricting duckDB performance for some benchmarks by using single thread." -- We should make what we measured clearer
- [x] work in "One related angle could be the ease of developing systems (applications) on top of DataFusion, potentially including the overhead in terms of lines of code or engineering hours in developing a simple system/application with DataFusion and using a different stack or being customly built. Similarly, performance evaluation of systems relying on DataFusion could help in this direction as well."
Here are some other notes I have
The main criticism / weakness cited is that DataFusion doesn't demonstrate sufficient technical novelty other than integration of various existing ideas. I think this is a very valid point, and maybe we should re-emphasize the point more that it isn't technical novelty of any part, but the overall system.
How to better suit the cloud environment where most OLAP engines are running on nowadays is also not discussed in the paper.
This is a good point that would be good to work in
Similarly, the content of the paper would benefit from doing a deep dive into the query engine and a limited set of features based on how they are used by systems developed on DataFusion.
I agree this would be an interesting point, but given that we are already at the 12 page limit I am not sure how to do so in this particular paper. Maybe these would make good follow on papers or blog posts (@appletreeisyellow and I could potentially write one on how InfluxDB uses PruningPredicates 🤔 )
To update the draft I'm assuming we can just reuse the same overleaf project? we'd be happy to touch a bit more on the Comet side, and update the sentence 😂
DataFusion is used by several Spark native runtimes, including Blaze[ 10] and at least one project that is not yet open-source.
Yea, as Comet now is open sourced, we can explicitly mention the project (with project link) and more details about it.
To update the draft I'm assuming we can just reuse the same overleaf project? we'd be happy to touch a bit more on the Comet side, and update the sentence 😂
Yes, please, let's use the same overleaf project
Yea, as Comet now is open sourced, we can explicitly mention the project (with project link) and more details about it.
Yes please that would be great -- and it will also address some of the reviewer feedback suggesting more details on usecases
"The LLVM analogy distracts from the paper." - I happen to like this analogy (and I think @ozankabak does too), but maybe we can make this section shorter / more concise.
I think this analogy is very useful -- let's keep it. In my experience it also resonates with technical folks very well. Since this feedback seems like an outlier in terms of reception, I suggest we improve other aspects of the paper.
@JayjeetAtGithub is there any chance you can update your email address to one that works (rather than the influxdata one that does not)?
Also, it would be great if someone could work on cleaning up the bibliography.
Also, we maybe can add some other users like Seafowl (now part of enterprise DB), which I think could potentially be described as a postgres analytics acclerator (aka it is to postgres what comet is to spark). Maybe @gruuya can correct me if I got that wrong
which I think could potentially be described as a postgres analytics acclerator (aka it is to postgres what comet is to spark)
Yeah, basically that's what we strive for, thanks!
@alamb I updated my affiliation and email to that of UC Santa Cruz, my university.
An update here: I plan to take a pass through the draft the week of March 4 and implement the bulk of any feedback that was not yet implemented. After that week I'll likely take a few proofreading passes, but I don't expect to do any major revisions
I also don't plan to rerun benchmarks again due to lack of time. While the benchmark runs themselves are nicely automated thanks to @JayjeetAtGithub, analyizing the results takes significant time and research.
I started getting back into paper writing mode today. I plan to take a linear pass through the sections over the next few days addressing reviewer feedback.
This morning, I started working on the first page
Add more examples / better explanation of systems built on DataFusion (we have some good new examples I know of since -- Arroyo, Comet, and LanceDB comes to mind)
(that looks pretty much done now to me)
The main criticism / weakness cited is that DataFusion doesn't demonstrate sufficient technical novelty other than integration of various existing ideas. I think this is a very valid point, and maybe we should re-emphasize the point more that it isn't technical novelty of any part, but the overall system.
I reworded the abstract to try and make the "not novel" point more explicitly. Here is what I came up with:
"Apache Arrow DataFusion\cite{DataFusion} is a fast, embeddable, and extensible query engine written in Rust\cite{Rust} that uses Apache Arrow\cite{Arrow} as its memory model. While the individual techniques used by DataFusion have been previously described, it differs from other industrial strength engines by providing competitive performance \textit{and} an open architecture that can be customized using over 10 major extension APIs. This flexibility has led to its use in many commercial and open source databases, machine learning pipelines, and other data-intensive systems. We anticipate that the accessibility and versatility of DataFusion, along with its competitive performance, will further enable the proliferation of high-performance custom data infrastructures tailored to specific needs."
Please move the figure out of the first page, or to the bottom of the first page. It is distracting to read the caption of Figure 1 before the abstract.
I personally like the visual impact of the figure at the beginning so I would prefer keeping its location where it is. However, as the reviewer points out, the extended caption on the figure was duplicative / repetitive with the abstract. I thus reduced the caption to the following, which I think captures the essence with less distraction
"When building with DataFusion, system designers implement domain-specific features via extension APIs (blue), rather than re-implementing standard OLAP query engine technology (green)."
I also updated the figure with the new DataFusion logo https://github.com/apache/arrow-datafusion/issues/8788 (thanks @pinarbayata)
I think the first page is now looking quite good
I plan to work on the other feedback items noted in https://github.com/apache/arrow-datafusion/issues/8373#issuecomment-1931982085 over the next few days and will keep that checklist updated with my progress
"The LLVM analogy distracts from the paper." - I happen to like this analogy (and I think @ozankabak does too), but maybe we can make this section shorter / more concise.
I reduced the space taken by the LLVM section by distilling, while still leaving the section and content in place. I think we could probably reduce it a bit more if anyone wanted to wordsmith some more
"Section 5.1 Engine overview and Figure 2 need to be more extensive to be able to follow the rest of Section 5."
I took a pass through the start of section 5 and figure 2 as well. I moved the content of the Figure 2 caption into the first paragraph of Section 5 as a bullet point, and tried to better align it to the section headers.
How to better suit the cloud environment where most OLAP engines are running on nowadays is also not discussed in the paper.
I took a shot at rewording the introduction to mention cloud databases (mainly as an example of the forces that drive the need for new databases with OLAP engines)
?The authors mention a caveat of restricting duckDB performance for some benchmarks by using single thread." -- We should make what we measured clearer
I moved the discussion of restricting per-core performance into the "Single Core Efficineny" sub section, out of the experimental evaluation introduction, which I think makes it clearer the restriction was only used for the single-core efficiency experiments