arrow
arrow copied to clipboard
ARROW-17351: [C++][Compute] Implement a parser for Expressions
I've seen a few people on the mailing list asking for something like this and I've wanted it myself, so I went ahead and implemented a parser for a lisp-like way of generating Expressions. Calls are of the form (<fn> <args>)
, scalars are of the form $type:value
, and field refs are of the form !<dot path>
.
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
https://issues.apache.org/jira/browse/ARROW-17906
:warning: Ticket has no components in JIRA, make sure you assign one.
If this is meant to be a public API, then I'd expect a design discussion on the ML.
Good point, I'll start a discussion.
https://issues.apache.org/jira/browse/ARROW-17351
@pitrou @westonpace I've updated the parser to reflect the discussion on the mailing list. The language now looks like the traditional function call syntax instead of the lisp-style syntax. I've also gotten rid of the !
needed before FieldRefs (now it's either .name
or [idx]
). I've kept the syntax for literals the same
add(.a, $int32:1)
Added better error handling and support for escaping stuff with "".
LGTM! For the next PRs, would be great to invest a bit to improve even more the error handling: to provide more context to the user in the case of invalid expressions.
For instance, for this invalid expression:
add($duration(MILLI):10, $duration(MILLIa):20)
we get this error:
'_error_or_value82.status()' failed with Invalid: Error at index 43: Unterminated data type arg list!
it would be nice to get a more precise error message here.
btw I built this on a macbook pro m1 without issues (clang compiler)
Hey there all, I'm very interested in getting this PR merged ASAP. Is there any remaining work on this to get it merged?
@NoahFournier Sorry. The project is lacking review bandwidth at the moment, so we have to prioritize work and this might unfortunately take some time.
I'd like to revive this as it has been an ask for some time and I think it is important. The technical issues of how the parser is created are probably more minor than the maintenance issue of making sure we come up with an expression syntax we are willing to support and expect to last.
There was a ML discussion on this but I feel it stalled out somewhat. Part of the challenge is that there were two alternatives proposed. Another challenge is that it would be unfortunate to adopt one standard in Arrow only to have Substrait adopt a different standard later. I propose the following:
- Build up a corpus of example expressions (10-20 or so) that demonstrate the various features (different types of scalars, escaping strings, etc.)
- Create a grammar for all proposals (I believe this will help when communicating)
- Send a message to the Substrait mailing list with the proposal
- Revive the Arrow ML discussion and point any interested parties to the Substrait discussion
- Once the Substrait discussion reaches consensus we can merge a parser into arrow-c++
There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ
Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list?
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍
There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ
Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list?
while the substrait design might be the correct direction to go, but I feel that's a much broader scope compared with this. this PR could bring the preliminaries of filtering into arrow, so some user requests could be fulfilled.
and when the substrait integration is mature and complete, this can be switched at that point.
all in all, folks, @ianmcook @amol- @westonpace any chance this gets revived and get merged?
Hi @zinking , I don't think there is any plan to revive this.
FWIW, the Substrait ExtendedExpression support landed in C++ and Python[1]. The Java implementation[2] is in final review stages as well. I believe this is the current preferred approach.
[1] https://github.com/apache/arrow/pull/34834 [2] https://github.com/apache/arrow/pull/35570