arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17351: [C++][Compute] Implement a parser for Expressions

Open save-buffer opened this issue 2 years ago • 7 comments

I've seen a few people on the mailing list asking for something like this and I've wanted it myself, so I went ahead and implemented a parser for a lisp-like way of generating Expressions. Calls are of the form (<fn> <args>), scalars are of the form $type:value, and field refs are of the form !<dot path>.

save-buffer avatar Oct 01 '22 04:10 save-buffer

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Oct 01 '22 04:10 github-actions[bot]

https://issues.apache.org/jira/browse/ARROW-17906

github-actions[bot] avatar Oct 01 '22 04:10 github-actions[bot]

:warning: Ticket has no components in JIRA, make sure you assign one.

github-actions[bot] avatar Oct 01 '22 04:10 github-actions[bot]

If this is meant to be a public API, then I'd expect a design discussion on the ML.

pitrou avatar Oct 05 '22 13:10 pitrou

Good point, I'll start a discussion.

save-buffer avatar Oct 05 '22 18:10 save-buffer

https://issues.apache.org/jira/browse/ARROW-17351

github-actions[bot] avatar Oct 10 '22 20:10 github-actions[bot]

@pitrou @westonpace I've updated the parser to reflect the discussion on the mailing list. The language now looks like the traditional function call syntax instead of the lisp-style syntax. I've also gotten rid of the ! needed before FieldRefs (now it's either .name or [idx]). I've kept the syntax for literals the same

add(.a, $int32:1)

save-buffer avatar Oct 17 '22 18:10 save-buffer

Added better error handling and support for escaping stuff with "".

save-buffer avatar Nov 23 '22 02:11 save-buffer

LGTM! For the next PRs, would be great to invest a bit to improve even more the error handling: to provide more context to the user in the case of invalid expressions. For instance, for this invalid expression: add($duration(MILLI):10, $duration(MILLIa):20) we get this error: '_error_or_value82.status()' failed with Invalid: Error at index 43: Unterminated data type arg list! it would be nice to get a more precise error message here. btw I built this on a macbook pro m1 without issues (clang compiler)

aucahuasi avatar Nov 24 '22 23:11 aucahuasi

Hey there all, I'm very interested in getting this PR merged ASAP. Is there any remaining work on this to get it merged?

noahfrn avatar Dec 15 '22 13:12 noahfrn

@NoahFournier Sorry. The project is lacking review bandwidth at the moment, so we have to prioritize work and this might unfortunately take some time.

pitrou avatar Dec 15 '22 16:12 pitrou

I'd like to revive this as it has been an ask for some time and I think it is important. The technical issues of how the parser is created are probably more minor than the maintenance issue of making sure we come up with an expression syntax we are willing to support and expect to last.

There was a ML discussion on this but I feel it stalled out somewhat. Part of the challenge is that there were two alternatives proposed. Another challenge is that it would be unfortunate to adopt one standard in Arrow only to have Substrait adopt a different standard later. I propose the following:

  • Build up a corpus of example expressions (10-20 or so) that demonstrate the various features (different types of scalars, escaping strings, etc.)
  • Create a grammar for all proposals (I believe this will help when communicating)
  • Send a message to the Substrait mailing list with the proposal
  • Revive the Arrow ML discussion and point any interested parties to the Substrait discussion
  • Once the Substrait discussion reaches consensus we can merge a parser into arrow-c++

westonpace avatar Jan 06 '23 01:01 westonpace

There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ

Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list?

ianmcook avatar Jan 25 '23 23:01 ianmcook

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

amol- avatar Mar 30 '23 17:03 amol-

There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ

Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list?

while the substrait design might be the correct direction to go, but I feel that's a much broader scope compared with this. this PR could bring the preliminaries of filtering into arrow, so some user requests could be fulfilled.

and when the substrait integration is mature and complete, this can be switched at that point.

all in all, folks, @ianmcook @amol- @westonpace any chance this gets revived and get merged?

zinking avatar Jul 31 '23 09:07 zinking

Hi @zinking , I don't think there is any plan to revive this.

FWIW, the Substrait ExtendedExpression support landed in C++ and Python[1]. The Java implementation[2] is in final review stages as well. I believe this is the current preferred approach.

[1] https://github.com/apache/arrow/pull/34834 [2] https://github.com/apache/arrow/pull/35570

danepitkin avatar Sep 11 '23 23:09 danepitkin