greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Support setting time range in `Copy From` statement

Open MichaelScofield opened this issue 11 months ago • 10 comments

What type of enhancement is this?

User experience

What does the enhancement do?

As title.

Currently there's timestamp_range: Option<TimestampRange> field in the struct CopyTableRequest, but it's not set. We can set it from the SQL parser first, then check if it works.

The copy from in postgresql supports "where condition" (https://www.postgresql.org/docs/current/sql-copy.html), maybe we can do that, too. For this issue, we can extract the time range from the "where condition", and leave more general filtering in the future.

Implementation challenges

No response

MichaelScofield avatar Mar 14 '24 05:03 MichaelScofield

hi @MichaelScofield ,i want to take this, would you give me some further advice?

naosense avatar Mar 19 '24 22:03 naosense

@naosense Sure. The "copy from" statement is parsed in file "copy_parser.rs", you can start from there. First you might want to see how "where" can be parsed in sqlparser. If lucky, the "copy from" statement from sqlparser may already carry the "where" part. Then you need to find how to extract the desired "time range" from the filters in "where". Finally set the time range in the struct CopyTableRequest, do some test to see if it works!

MichaelScofield avatar Mar 20 '24 12:03 MichaelScofield

hi @MichaelScofield ,i want to take this, would you give me some further advice?

I think we can add a where clause for the Copy From statement, then forward the filter of the where clause to the scan request(The Copy From statement sends a scan request to retrieve data).

WenyXu avatar Mar 20 '24 14:03 WenyXu

After my initial research, I do not have much idea about solving this issue, probably because I am not familiar with the greptimedb and sqlparser. I hope someone can take over, I'm afraid this issue is beyond my ability😅.

naosense avatar Mar 22 '24 14:03 naosense

I spent quite a while trying to solve this problem. There has been a small progress, but since I am a newbie in rust, I find it difficult for me to solve rust syntax problems in a short time, such as error handling, type conversion, and syntax sugar. So I decided to post my progress here, welcome to discuss ideas.

First I successfully parsed the where clause and got the expression. As a rookie, it took me a while to get if-else to return the same type of result...

let predicate = if self.parser.parse_keyword(Keyword::WHERE) {
            Some(self
                .parser
                .parse_expr()
                .context(error::SyntaxSnafu)?;)
} else {
            None
};

Then I found that the returned expression did not meet the needs of the issue. What needs to be returned here is timestampRange. So, the next step is to convert the expression in sql-parser into timestampRange. I think what can be done is to:

  1. check if the type of the expression is AnyOp or AllOp
  2. check if the data type is timestamp
  3. handle various >, < situations
  4. ...

okJiang avatar Mar 23 '24 14:03 okJiang

@naosense that's ok, feel free to take another good first issue!

MichaelScofield avatar Mar 25 '24 03:03 MichaelScofield

@okJiang we can make things a lot simpler by restricting the allowed syntax here. The filters in copy from stmt only take one of the following 3 forms:

  • where <F> and <F>
  • where <F> or <F>
  • where <F>

where F = <column> > | >= | < | <= | = <timestamp literal value>

MichaelScofield avatar Mar 25 '24 03:03 MichaelScofield

I am trying to solve this problem. It seems that I need to pass arguments to fn to_copy_table_request and I can

  1. covert string to timestamp with timezone
  2. check the specified columns is correct.

I can't find a elegant way to pass the parsed arugments.

pub type WhereClause = Option<(
    Option<(ObjectName, ParserBinaryOperator, String)>,
    Option<ParserBinaryOperator>,
    Option<(ObjectName, ParserBinaryOperator, String)>,
)>;

Should I simply pass Expr to the to_copy_table_request function and parse it again, or are there some methods to make the WhereClause more elegant

kysshsy avatar May 22 '24 00:05 kysshsy

Should I simply pass Expr to the to_copy_table_request function and parse it again, or are there some methods to make the WhereClause more elegant

@kysshsy For simplicity, what about reusing the time range option? https://github.com/GreptimeTeam/greptimedb/blob/a3a2c8d06340a296c72de79016ecbd113d4f8954/src/operator/src/statement.rs#L335

Then we use the time range to filter output record batches from files.

If we'd like to support WHERE, maybe we should provide full capabilities of the where expression. We could create a new issue for WHERE support in COPY statement.

evenyag avatar May 24 '24 03:05 evenyag

@kysshsy Since we've marked this issue as GFI, I think we can resort to a simple solution. We don't need to support complex time ranges expressed by where clause, but to reuse the time range in WITH options, which is already in CopyTableRequest:

https://github.com/GreptimeTeam/greptimedb/blob/090b59e8d6406397eb1668dc6c2e4c68e914b664/src/table/src/requests.rs#L231

This will solve 90% use cases.

Once we've built a record batch stream via build_read_stream: https://github.com/GreptimeTeam/greptimedb/blob/f93b5b19f05bde852b08366f52f0cf30d09c1c96/src/operator/src/statement/copy_table_from.rs#L208

we can evaluate the batches yielded just like here: https://github.com/GreptimeTeam/greptimedb/blob/40c585890a3cb231dacb5b7cdf1496a06e581826/src/operator/src/statement/copy_table_to.rs#L104-L135

v0y4g3r avatar May 24 '24 07:05 v0y4g3r