greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

feat: support `WITHIN` filter

Open NiwakaDev opened this issue 11 months ago • 17 comments

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

Part of #3755.

This PR supports within syntax like:

select * from monitors where ts within '2024';

select * from monitors where ts within '2024-04-19';

select * from monitors where ts within '2024-04-19 23:50';

They are converted to

select * from monitors where ts >= '2024-01-01 00:00:00' and ts < '2025-01-01 00:00:00';

select * from monitors where ts >= '2024-04-19 00:00:00' and ts < '2024-04-20 00:00:00';

select * from monitors where ts >= '2024-04-19 23:50:00' and ts < '2024-04-19 23:51:00';

What's changed and what's your intention?

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • [ ] I have written the necessary rustdoc comments.
  • [ ] I have added the necessary unit tests and integration tests.
  • [ ] This PR requires documentation updates.
  • [ ] API changes are backward compatible.
  • [ ] Schema or data changes are backward compatible.

Summary by CodeRabbit

  • New Features
    • Enhanced query filtering now accepts year-only date inputs, automatically converting them into valid date ranges.
    • Expanded SQL functionality adds new aggregation and data selection options for analyzing system metrics.

  • Chores
    • Updated an internal dependency for improved stability.

  • Tests
    • Added new SQL queries and data insertion tests to validate the improved filtering and aggregation behaviors.

NiwakaDev avatar Jan 18 '25 21:01 NiwakaDev

[!IMPORTANT]

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request updates the sqlparser dependency in Cargo.toml and introduces new functionality for filtering operations. A new WithinFilterFunction is added to the math functions, and a within_filter module is integrated into the query engine. The planner now rewrites the logical plan with an asynchronous call to a range plan rewriter, and a new WithinFilterRule is applied to transform filter expressions. Additionally, test cases and SQL scripts are enhanced with new queries and assertions, reflecting these updates.

Changes

File(s) Change Summary
Cargo.toml Updated sqlparser dependency from the GreptimeTeam repository/revision to the NiwakaDev repository/revision with unchanged features
src/common/function/src/scalars/math.rs Added new WithinFilterFunction with its structure, trait implementations (name, return_type, signature, eval) for boolean output and placeholder error message
src/log-query/src/log_query.rs Modified test case in canonicalize: changed TimeFilter start date from "2023-10-01" to "2023"
src/query/src/{lib.rs, planner.rs, query_engine/state.rs, within_filter.rs} Introduced within_filter module; added a new RangePlanRewriter step in plan_sql; updated QueryEngineState extension_rules with new WithinFilterRule; added WithinFilterRule implementation that transforms filter expressions
tests/cases/standalone/common/{basic.result, basic.sql} Added new SQL queries for selecting, aggregating, and inserting data on the system_metrics table, including filtering by a within_filter predicate

Sequence Diagram(s)

sequenceDiagram
    participant Planner as DfLogicalPlanner
    participant RangeRewriter as RangePlanRewriter
    participant FilterRule as WithinFilterRule
    participant Optimizer as Query Optimizer

    Planner->>Planner: Convert SQL to initial plan (sql_to_rel)
    Planner->>RangeRewriter: Await rewrite(original_plan)
    RangeRewriter->>FilterRule: Analyze LogicalPlan for within_filter predicate
    FilterRule-->>RangeRewriter: Return transformed LogicalPlan
    RangeRewriter-->>Planner: Provide rewritten LogicalPlan
    Planner->>Optimizer: Optimize LogicalPlan
    Optimizer-->>Planner: Return optimized LogicalPlan

Suggested reviewers

  • waynexia
  • zhongzc
  • tisonkun

Poem

I'm just a little rabbit, hopping through the code,
Changes bloom like carrots along my winding road.
With within_filter and plans that twist and turn,
I nibble on new features, ever eager to learn.
Coding carrots and crisp SQL—oh, what a joyful spree!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar Jan 18 '25 21:01 coderabbitai[bot]

@killme2008 @waynexia I've created a rough implementation of the WITHIN filter:

My idea is

  1. Parse WITHIN filters, and treats it as functions in the parser phase(https://github.com/NiwakaDev/sqlparser-rs/commit/b3584823b236db3c244b70945008f0aae65c5e39).
  2. Convert WITHIN filter functions to normal filters in the planner phase.

This idea is based on RANGE query implementation. If the approach seems good, I'd like to proceed with it. What do you think?

NiwakaDev avatar Jan 18 '25 21:01 NiwakaDev

@coderabbitai review

killme2008 avatar Feb 07 '25 06:02 killme2008

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch PR details
gh pr view 5397 --json files,body,title,commits

# Search for test files related to WITHIN filter
rg -l "WITHIN" --type rust

# Check for documentation updates
fd -e md -e rs | xargs rg -l "WITHIN"

Length of output: 2304


🌐 Web query:

What is WITHIN filter in time series databases?

💡 Result:

While none of the provided sources explicitly mention a "WITHIN" filter operator, time series databases commonly use three equivalent filtering mechanisms that achieve similar spatial/temporal containment checks:

## Temporal Range Filtering
Used to select entries within a specific time window:
```sql
-- RavenDB example with BETWEEN
SELECT timeseries FROM HeartRates 
BETWEEN '2020-05-17T00:00:00' AND '2020-05-17T00:10:00' [1]

-- QuestDB's IN syntax for time containment
SELECT * FROM trades 
WHERE timestamp IN '2023-12-20;15s' -- 15-second window [5]
```

## Value Range Filtering
Filters numerical values within specified bounds:
```sql
-- InfluxQL example excluding outliers 
WHERE (154 - "value") BETWEEN 0 AND 760 [3]

-- RavenDB value threshold filtering
WHERE ts.Value > 75 [1]
```

## Tag Containment Filtering
Checks if tags exist in a specified set:
```csharp
// RavenDB tag whitelisting
.Where(ts => ts.Tag == "watches/fitbit" || 
             ts.Tag == "watches/apple") [1]
```

These patterns align with the core filtering capabilities described across sources:
1. Temporal containment through date ranges (`BETWEEN`, `IN`) [1][5]
2. Numerical range checks using comparison operators [1][3]
3. Set membership checks for tags/attributes [1][4]

While terminology varies between implementations (BETWEEN, IN, WHERE clauses), they all enable selecting data points that fall **within** specified temporal, numerical, or categorical boundaries - essentially implementing different flavors of spatial/temporal containment filters.

Citations:

- 1: https://ravendb.net/docs/article-page/6.2/csharp/document-extensions/timeseries/querying/filtering
- 2: https://docs.datomic.com/reference/filters.html
- 3: https://community.grafana.com/t/how-to-filter-noise-from-time-series-graph/108812
- 4: https://www.alibabacloud.com/blog/key-concepts-and-features-of-time-series-databases_594734
- 5: https://questdb.com/glossary/time-series-database/
- 6: https://devops.com/time-series-database-basics/
- 7: https://www.splunk.com/en_us/blog/learn/time-series-databases.html

🏁 Script executed:

#!/bin/bash
# Check the new within_filter implementation
cat src/query/src/within_filter.rs

# Check test cases
cat tests/cases/standalone/common/basic.sql
cat tests/cases/standalone/common/basic.result

# Check planner integration
cat src/query/src/planner.rs

Length of output: 20487


@killme2008 Here's my review of the WITHIN filter implementation:

  1. Documentation:

    • Please add rustdoc comments for the WithinFilterRule struct and its implementation
    • Document the supported date formats and examples in the comments
  2. Error Handling:

    • Replace todo!() macro with proper error handling for unsupported cases
    • Consider adding custom error types for invalid date formats
    • Add validation for the timestamp column type
  3. Date Format Support:

    • Current implementation only supports year format ('2022')
    • Consider supporting more formats like:
      • Full date: '2022-11-03'
      • Month: '2022-11'
      • Time ranges: '1h', '7d'
  4. Testing:

    • Add more test cases covering:
      • Different date formats
      • Invalid inputs
      • Edge cases (year boundaries)
      • Error scenarios

The approach of converting WITHIN to range filters looks good, but needs more work on documentation, error handling, and format support. Would you like me to provide specific code examples for any of these points?

:white_check_mark: Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot] avatar Feb 07 '25 06:02 coderabbitai[bot]

@killme2008 @waynexia I've created a rough implementation of the WITHIN filter:

My idea is

  1. Parse WITHIN filters, and treats it as functions in the parser phase(NiwakaDev/sqlparser-rs@b358482).
  2. Convert WITHIN filter functions to normal filters in the planner phase.

This idea is based on RANGE query implementation. If the approach seems good, I'd like to proceed with it. What do you think?

I think it's great! Let's do it!

killme2008 avatar Feb 07 '25 06:02 killme2008

@killme2008 Marked this PR as review.

NiwakaDev avatar Feb 09 '25 12:02 NiwakaDev

Great work! I plan to review this today

waynexia avatar Feb 12 '25 19:02 waynexia

@NiwakaDev Hi, do you have time to resolve these comments? I think this PR is almost done, looks great, thank you.

killme2008 avatar Feb 19 '25 02:02 killme2008

@killme2008

do you have time to resolve these comments?

Sorry for the delay. I'll address the reviews this weekend.

NiwakaDev avatar Feb 19 '25 03:02 NiwakaDev

I have a few concerns about adding a new within grammar. It brings more burden of maintaining the sqlparser lib, and is not generic enough (I did some search and see no one else provides similar things).

To achieve a similar goal, @sunng87 provides another alternative method that implements it as a within() function like where within(ts, '2024') or where within(ts, '2024-04-19 23:50'). This is fully compatible with the current sql grammar and people's usual practice.

Furthermore, we can support relative time filters, like where within(ts, '5d'::INTEVAL), or arbitrary time ranges where within(ts, '2024-02-24', '2024-03-15').

But WITHIN is already a keyword, so we need to pick up another name. E.g., with_in, in_range, time_filter etc. I don't have a good choice in my mind, open to suggestions.

Really sorry to put this comment that late 🙇 It just came across to my mind on the drive

waynexia avatar Feb 24 '25 19:02 waynexia

@waynexia

I did some search and see no one else provides similar things

It seems like questdb supports a similar feature: https://questdb.com/docs/reference/sql/where/#time-range-where-in. But it might not be generic.

One thing I’m wondering:

if we implement it as a function, is it pushed down like a normal filter?

NiwakaDev avatar Feb 24 '25 23:02 NiwakaDev

I have a few concerns about adding a new within grammar. It brings more burden of maintaining the sqlparser lib, and is not generic enough (I did some search and see no one else provides similar things).

To achieve a similar goal, @sunng87 provides another alternative method that implements it as a within() function like where within(ts, '2024') or where within(ts, '2024-04-19 23:50'). This is fully compatible with the current sql grammar and people's usual practice.

Furthermore, we can support relative time filters, like where within(ts, '5d'::INTEVAL), or arbitrary time ranges where within(ts, '2024-02-24', '2024-03-15').

But WITHIN is already a keyword, so we need to pick up another name. E.g., with_in, in_range, time_filter etc. I don't have a good choice in my mind, open to suggestions.

Really sorry to put this comment that late 🙇 It just came across to my mind on the drive

If you look at the PR in the parser https://github.com/GreptimeTeam/sqlparser-rs/pull/16,

You will find that it doesn't change the existing SQL parser code but adds new parsing logic. So maintenance isn't a problem here at all.

This grammar is supported by QuestDB. I think it's friendlier than a SQL function. And of course, we can support the interval by extending in option:

select * from monitors where ts within '2024-04-19' in '1 day'

Like that.

killme2008 avatar Feb 25 '25 12:02 killme2008

if we implement it as a function, is it pushed down like a normal filter?

Sure, we can do this just as other functions or operators

You will find that it doesn't change the existing SQL parser code but adds new parsing logic. So maintenance isn't a problem here at all.

You could have a try yourself before you put "not a problem at all" :) RANGE clause doesn't change existing code either.

I think it's friendlier than a SQL function. And of course, we can support the interval by extending in option

Do you really think the next example is easy to understand?

waynexia avatar Feb 25 '25 22:02 waynexia

The range query is worth the maintenance cost due to its usefulness and elegance, but I agree with you that within is unnecessary. @waynexia

Could you refactor this PR, @NiwakaDev? Please implement it as an SQL function. Most of the work is already done. Thanks!

killme2008 avatar Feb 26 '25 03:02 killme2008

@killme2008 @waynexia

Please implement it as an SQL function. Most of the work is already done. Thanks!

The new following approach:

  1. Registers WITHIN function.
  2. Converts WITHIN function to a normal filter in the planner phase.

I guess the second step is almost the same. What do you think?

NiwakaDev avatar Feb 26 '25 03:02 NiwakaDev

@killme2008 @waynexia

Please implement it as an SQL function. Most of the work is already done. Thanks!

The new following approach:

  1. Registers WITHIN function.
  2. Converts WITHIN function to a normal filter in the planner phase.

I guess the second step is almost the same. What do you think?

Yes, definitely.

killme2008 avatar Feb 26 '25 03:02 killme2008

Thank you @NiwakaDev @killme2008 !

waynexia avatar Feb 26 '25 18:02 waynexia

@NiwakaDev Sorry to disturb you, but do you have time to review this PR? It's nearly finished and would benefit our users. If not, I would love to help.

killme2008 avatar Apr 07 '25 09:04 killme2008

@killme2008

Sorry for the delay. March was busy😅

I can finally start working on this PR this week if you don't mind. If I can't make good progress by next week, I'll ask for your help.

Thank you!

NiwakaDev avatar Apr 08 '25 03:04 NiwakaDev

@killme2008

Sorry for the delay. March was busy😅

I can finally start working on this PR this week if you don't mind. If I can't make good progress by next week, I'll ask for your help.

Thank you!

That would be great! Thank you.

killme2008 avatar Apr 08 '25 11:04 killme2008

Close it due to lack of progress; it can be reopened if there are any updates.

fengjiachun avatar Aug 13 '25 06:08 fengjiachun