ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(risingwave): add streaming DDLs

Open KeXiangWang opened this issue 1 year ago • 3 comments

Description of changes

Risingwave plays a role as a computing engine and storage system in Streaming ecosystems. Usually, a streaming workload will include two/three systems like this: upstream data source --> Risingwave Or, upstream data source --> Risingwave --> downstream data sink

This PR adds streaming support, mainly new relations related to streaming, for the Risingwave backend.

To be specific, Four types of new relations are introduced:

  • Source, data sources that encompass a connector connected to an upstream data system like Kafka. A source works like an extraction or a placeholder for an upstream system, and although it has columns, it does not store any data itself.
  • Materialized View, which is Risingwave's core concept for streaming. One can create an MV with a query on existing tables or sources. The data in MV is automatically updated in a real-time way. Users can access the data with a select statement just like accessing a table.
  • Table with connector, works like a combination of a source and a normal table. The difference between Table with connector and Source is that, once the table is created, it will automatically start to consume data from upstream systems and update it into Risingwave.
  • Sink, unlink a Materialized View which stores the result of a query in RW, users can choose to sink the result to a downstream system, e.g. Redis, and then read the data in the downstream system. Unlike an MV, User cannot access the data directly from a sink.

Some minor changes:

  1. Pump the image version to a new nightly build, as the new image solves some issues found in the previous PR.
  2. Pump the sqlalchemy-risingwave version from 1.0.0 to 1.0.1, some implementations are updated.
  3. Mark one test, test_semi_join, as skipped for risingwave backend. The test sometimes stuck and it seems to be Risingwave's fault. I'll continue to investigate.

One issue introduced in this PR: Risingwave backend's implementation is sqlalchemy-based. But sqlalchemy has no corresponding concept for sources and sinks. So, in order to work around this issue, I temporarily categorize a source as a view. So sqlalchemy can access a source's metadata (names and types of its columns) like a view. However, this causes a side effect, when users call list_view, they may see some unexpected views that are actually sources. This can be fixed by rewriting the implementation of the list_view func in the Risingwave Backend. I'll fix it later.

Issues closed

KeXiangWang avatar Feb 06 '24 02:02 KeXiangWang

It seems I have no access to invite reviewers. @gforsyth @cpcloud @chloeh13q @deepyaman Could you please help take a review or invite the appropriate ones?

KeXiangWang avatar Feb 06 '24 03:02 KeXiangWang

Hi @KeXiangWang - thanks for opening this!

We're unfortunately in the midst of a large internals refactor moving all backends to use sqlglot instead of sqlalchemy. We hope to merge https://github.com/ibis-project/ibis/tree/the-epic-split into main soon (maybe by end of week?) after which we'll be back to accepting PRs to main. Until that's done though we don't really want to merge additional large PRs as this would complicate getting the refactor in. We'll follow up after the-epic-split is merged to discuss how best to implement the work here using sqlglot, but until then things are blocked on us finishing up the internals refactor.

jcrist avatar Feb 06 '24 20:02 jcrist

We're unfortunately in the midst of a large internals refactor moving all backends to use sqlglot instead of sqlalchemy.

It's OK. I have no experience on sqlplot. After the PR is merged, could your engineers please also help refactor the related codes in this PR? I'd like to provide help if needed.

KeXiangWang avatar Feb 06 '24 20:02 KeXiangWang

@chloeh13q @cpcloud: bump this up, is this something we can help wth getting the sqlglot refactoring?

zhenzhongxu avatar Feb 23 '24 06:02 zhenzhongxu

It looks like the first order of business would be to rebase this PR on main. @KeXiangWang Can you do that?

cpcloud avatar Feb 23 '24 15:02 cpcloud

It looks like the first order of business would be to rebase this PR on main.

OK, I'll try my best.

KeXiangWang avatar Feb 23 '24 18:02 KeXiangWang

Hi @cpcloud, I've rebased the PR to newest main and fix all the issue introduced by the sqlglot refactoring. Could you please help take a look.

Besides, I find some tests are unable to pass even with the main branch, so I leave them unchanged for now. Do you have any ideas? Is there any configuration I missed?

KeXiangWang avatar Feb 28 '24 22:02 KeXiangWang

I have some open questions on a high level:

I understand the motivation behind introducing separate APIs for sources, materialized views, tables, and sinks for the RW backend. Sources, materialized views, tables, and sinks are distinct objects in RW. However, I wonder how generalizable this is across different streaming backends.

For the user who is already familiar with RW's API, this implementation feels natural. But for the user who is coming from another backend, it may cause some confusion.

For example, Flink also has the same concepts, but it doesn't use a different API for each. Source, tables, and sinks are all created with CREATE TABLE. A lot of these underlying differences are abstracted away from the user.

Of course, it's also okay that we just do this for the RW backend for now and refactor at a later point if there is a need to consolidate some of the APIs.

chloeh13q avatar Mar 06 '24 21:03 chloeh13q

I wonder how generalizable this is across different streaming backends.

Good question. These concepts are common in streaming backends, although different systems may have different names for them. For example, as Flink doesn't have a normal persistent table in a traditional database, its tables are used to express a streaming job. While in RW, Materialize, and Timeplus, we use materialized views. For sources, Flink combines a streaming job and the dependent sources together as a Table, while RW and materialize use sources objects and Timeplus use stream objects.

Flink also has the same concepts, but it doesn't use a different API for each.

Flink positions itself as a 'compute engine,' which means it's designed not to store data but to process it. In Flink, a table is essentially a streaming job that interacts with sources or sinks. The outcomes of these jobs are not stored within Flink itself. In contrast, streaming databases such as RW, Materialize, and Timeplus, function as databases that store the results locally. While Flink tables are declared as Tables, you cannot directly query data from them. Instead, you must route the data to an external system to access the results. With Flink, you trade off a simpler API for the flexibility and simplicity of its architecture. On the other hand, streaming databases offer the ability to customize your streaming jobs by combining various objects to suit your needs, providing direct accessing or sinking to external systems at the same time.

KeXiangWang avatar Mar 11 '24 06:03 KeXiangWang

@KeXiangWang Yep that makes sense. Since Ibis works across engines & databases, I'd imagine that these are questions that Ibis will need to address. But I think it also depends on the background of the user, e.g., whether it's someone who's experimenting with streaming or someone who already. has some expertise in streaming and wants to try out the Ibis API. In any case, these may be questions that we can leave open and come back to, when there is user feedback backing up one way or another.

chloeh13q avatar Mar 20 '24 23:03 chloeh13q

@KeXiangWang -- we've just merged in #8655 which moves us away from using the word schema in any hierarchical sense. We're standardizing on "database" as a collection of tables and "catalog" as a collection of database. If this PR allows maintainers to push updates, I can take on porting over your work here to the new naming and push that up.

gforsyth avatar Mar 26 '24 16:03 gforsyth

If this PR allows maintainers to push updates, I can take on porting over your work here to the new naming and push that up.

Thx @gforsyth . I really cannot fiind a button in this page to enable it, so I create a new PR here. We can push the following updates to the new PR. Thanks!

KeXiangWang avatar Mar 26 '24 16:03 KeXiangWang

I've pushed up the fixes to #8781 -- @KeXiangWang, you can either cherry-pick that commit back to this PR, or we can close this PR out in favor of your newer one, whichever you prefer!

gforsyth avatar Mar 26 '24 16:03 gforsyth

Hi @gforsyth . Thanks for the effort. I've update the patch to this PR and I'll close https://github.com/ibis-project/ibis/pull/8781 when this PR is merged. Thanks!

KeXiangWang avatar Mar 29 '24 15:03 KeXiangWang

@cpcloud @gforsyth Any further comments? Can we merge this PR now?

KeXiangWang avatar Mar 31 '24 11:03 KeXiangWang