sql icon indicating copy to clipboard operation
sql copied to clipboard

Add trendline PPL command

Open jduo opened this issue 1 year ago • 14 comments

Description

Adds the trendline command

Related Issues

Resolves #3013 #3011

Check List

  • [x] New functionality includes testing.
  • [x] New functionality has been documented.
  • [x] New functionality has javadoc added.
  • [x] New functionality has a user manual doc added.
  • [x] API changes companion pull request created.
  • [x] Commits are signed per the DCO using --signoff.
  • [x] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jduo avatar Oct 12 '24 00:10 jduo

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

jduo avatar Oct 14 '24 20:10 jduo

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

I would expect the only field out of this schema to be the one computation in trendline ("foo"), rather than all 3 fields in the real index, but perhaps I'm mistaken here.

jduo avatar Oct 14 '24 21:10 jduo

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

YANG-DB avatar Oct 17 '24 19:10 YANG-DB

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

jduo avatar Oct 17 '24 23:10 jduo

Requesting reviews from @LantaoJin @MaxKsyunz Thanks

jduo avatar Oct 23 '24 20:10 jduo

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense... @penghuo @dai-chen ??

YANG-DB avatar Oct 24 '24 16:10 YANG-DB

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense... @penghuo @dai-chen ??

@YANG-DB , I used the PPL parser code from the Spark PR. The schema semantics seem to be the same AFAIK, but I haven't tried the Spark one out. Same with the handling of results when there aren't enough samples (returning NULL) @kt-eliatra ?

jduo avatar Oct 24 '24 16:10 jduo

The majority of the implementation is done. There's some more work left to support datetime types. Only simple moving average is implemented, not weighted moving average.

jduo avatar Oct 27 '24 16:10 jduo

Datetime support has been added so this is effectively code complete (only supporting simple moving average for this iteration).

jduo avatar Oct 28 '24 23:10 jduo

@MaxKsyunz @Yury-Fridlyand

jduo avatar Oct 29 '24 19:10 jduo

This functionality hasn't been tested with pagination.

jduo avatar Oct 30 '24 17:10 jduo

There are syntax changes in the Spark version of this command that haven't been reflected here (opensearch-spark/#833).

  • Alias has been made optional.
  • There is now an optional sort specification.

I think the latter we could support here by having Analyzer return a LogicalSort on top of the LogicalTrendline if the sort clause is added.

It's not clear from me in the Spark version what the field name should be if an alias is not specified. @YANG-DB @salyh @kt-eliatra

jduo avatar Oct 30 '24 19:10 jduo

Any insight on what could trigger bwc errors here?

jduo avatar Oct 30 '24 20:10 jduo

There are syntax changes in the Spark version of this command that haven't been reflected here (opensearch-spark/#833).

* Alias has been made optional.

* There is now an optional sort specification.

I think the latter we could support here by having Analyzer return a LogicalSort on top of the LogicalTrendline if the sort clause is added.

It's not clear from me in the Spark version what the field name should be if an alias is not specified. @YANG-DB @salyh @kt-eliatra

https://github.com/opensearch-project/opensearch-spark/pull/833#discussion_r1823003234

salyh avatar Oct 30 '24 21:10 salyh

@salyh @YANG-DB , I have added the sort functionality and made alias optional.

jduo avatar Oct 31 '24 21:10 jduo

Rebasing on upstream/main to see if the bwc test failures go away.

jduo avatar Oct 31 '24 23:10 jduo

@salyh , @YANG-DB , @LantaoJin would you be able to review this PR? As the Spark PR finished it might make sense to do this while it is still fresh.

jduo avatar Oct 31 '24 23:10 jduo

Codecov Report

Attention: Patch coverage is 99.04306% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.54%. Comparing base (5716cab) to head (8d7bc7c). Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...pensearch/sql/ppl/parser/AstExpressionBuilder.java 81.81% 1 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3071      +/-   ##
============================================
+ Coverage     94.49%   94.54%   +0.05%     
- Complexity     5422     5458      +36     
============================================
  Files           528      530       +2     
  Lines         15450    15645     +195     
  Branches       1025     1046      +21     
============================================
+ Hits          14599    14792     +193     
- Misses          804      805       +1     
- Partials         47       48       +1     
Flag Coverage Δ
sql-engine 94.54% <99.04%> (+0.05%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 07 '24 19:11 codecov[bot]

@jduo can u plz check why the CI build failed ? is it related to this PR or generic ?

YANG-DB avatar Dec 04 '24 00:12 YANG-DB

@jduo can u plz check why the CI build failed ? is it related to this PR or generic ?

There's a mistake with the field name in the docs. I've updated now.

jduo avatar Dec 04 '24 00:12 jduo

Thanks LGTM !! @jduo can u plz check the failed CI tasks ?

Looks like a merge error in the grammar (lost a semi-colon). I'll post in an update shortly.

jduo avatar Dec 10 '24 19:12 jduo

@jduo LGTM ! @Yury-Fridlyand @acarbonetto can u plz review ?

YANG-DB avatar Dec 12 '24 18:12 YANG-DB

linkchecker will be fixed here: https://github.com/opensearch-project/sql/pull/3193#pullrequestreview-2494140519

acarbonetto avatar Dec 12 '24 18:12 acarbonetto